MASSACHUSETTS INSTITUTE OF TECHNOLOGY 
ARTIFICIAL INTELLIGENCE LABORATORY 


A.I. Technical Report No. 1445 


September, 1993 


Robust, High-Speed Network Design for Large-Scale Multiprocessing 


Andre DeHon 

andre@ai.mit.edu 


Abstract: Large-scale multiprocessing remains an elusive, yet promising paradigm for achieving 
high-performance computation. As machine size scales upward, there are two important aspects 
of multiprocessor systems which will generally get worse rather than better: (1) interprocessor 
communication latency will increase and (2) the probability that some component in the system 
will fail will increase. Both of these problems can prevent us from realizing the potential benefits of 
large-scale multiprocessing. In this document we consider the problem of designing networks which 
simultaneously minimize communication latency while maximizing fault tolerance for large-scale 
multiprocessors. Using a synergy of techniques including connection topologies, routing protocols, 
signalling techniques, and packaging technologies we assemble integrated, system-level solutions 
to this network design problem. In particular, we recommend the use of multipath, multistage 
networks, simple, source-responsible routing protocols, stochastic fault-avoidance, dense three- 
dimensional packaging, low-voltage, series-terminated transmission line signalling, and scan based 
diagnostic and reconfiguration. 

Acknowledgements: This report describes research done at the Artificial Intelligence Laboratory of the Mas¬ 
sachusetts Institute of Technology. Support for the Laboratory's Artificial Intelligence Research is provided in part by the 
Advanced Research Projects Agency under Office of Naval Research contract N00014-91-J-1698. This material is based 
upon work supported under a National Science Foundation Graduate Fellowship. Any opinions, findings, conclusions 
or recommendations expressed in this publication are those of the author and do not necessarily reflect the views of the 
National Science Foundation. 




Acknowledgments 


The ideas presented herein are the collaborative product of the MIT Transit Project under the 
direction of Dr. Thomas Knight. Jr. The viewpoint is my own and, as such, the description reflects 
my perspective and biases on the subject matter. Nonetheless, the whole picture presented here 
was shaped by the effort of the group and would definitely have been less complete without such a 
collective effort. 

At the expense of slighting some who’s influences and efforts I may not, yet, fully appreciated, 
I feel it only appropriate to acknowledge many by name. Knight provided many of the seed ideas 
which have formed the basis of our works. Preliminary work with the MIT Connection Machine 
Project and with the Symbolics VLSI group raised many of the issues and ideas which the Transit 
Project inherited. Collaboration with Tom Leighton, Bruce Maggs, and Charles Leiserson on 
interconnection topologies has also been instrumental in developing many of the ideas presented 
here. 

Knight, of course, provided the project with the oversight and encouragement to make this kind 
of effort possible. Henry Minsky has been a key player in the Transit effort since its inception. 
Minsky has helped shape ideas on topology, routing, and packaging. Without his VLSI efforts, 
RN1 would not have existed and all of us would have suffered greatly on other VLSI projects. 
Along with Knight, Alex Ishii and Thomas Simon have been invaluable in the effort to understand 
and develop high-performance signalling strategies. While Simon has a very different viewpoint 
than myself on most issues, the alternative perspective has generally been healthy and produced 
greater understanding. The packaging effort would remain arrested in the conceptual stage without 
the efforts of Fred Drenckhahn. Frederic Chong and Eran Egozy provided the cornerstone for our 
network organization evaluations and developments. Their efforts took vague notions and turned 
them into quantifiable entities which allowed us greater insight. Eran Egozy spearheaded the 
original METRO effort which also included Minsky, Simon, Samuel Peretz, and Matthew Becker. 
Simon helped shape the MBTA effort. The MBTA effort has also benefitted from contributions by 
Minsky, Timothy Kutscha, David Warren, and Ian Eslick. Patrick Sobalvarro, Michael Bolotski, 
Neil Brock, and Saed Younis have also offered notable help during the course of this effort. 

This work has also been shaped by numerous discussions with people in “rival” research groups 
here at MIT. Notably, discussions with Anant Agarwal, John Kubiatowicz, David Chaiken, and 
Kirk Johnson in the MIT Alewife Project have been quite useful. William Dally, Michael Noakes, 
Ellen Spertus, Larry Dennison, and Deborah Wallach of the Concurrent VLSI Architecture group 
have all provided many useful comments and feedback. Noakes and George Andy Boughton have 
provided much valuable technical support during the life of this project. I am also indebted to 
William Weihl, Gregory Papadopoulos, and Donald Troxel for their direction. 

Our productivity during this effort has been enhanced by the availability of software in source 
form. This has allowed us to customize existing tools to suit our novel needs and correct deficiencies. 
Notably, we have benefitted from software from the GNU Project, the Berkeley CAD group, and 
the Active Messages group at Berkeley. 

We have also been fortunate that several companies offer generous university programs through 
which they have made tools available to us. Intel has been most helpful in providing a rich suite of 
tools for working with the 80960 microprocessor series in source form. Actel provided sufficient 
discounts on their FPGA tools to make their FPGAs useful for our prototyping efforts. Exemplar has 


1 



also provided generous discounts on their synthesis tools. Cadence provided us with Verilog-XL 
for simulation. Logic Modeling Corporation has made their library of models for Verilog available. 
MetaSoftware provided us with HSpice. Many of our efforts could not have succeeded without the 
support of these industrial strength tools. 

This research is supported in part by the Defense Advanced Research Projects Agency under 
contracts N00014-87-K-0825 and N00014-91-J-1698. This material is based upon work supported 
under a National Science Foundation Graduate Fellowship. Any opinions, findings, conclusions 
or recommendations expressed in this publication are those of the author and do not necessarily 
reflect the views of the National Science Foundation. 

Fabrication of printed-circuit boards and integrated circuits was provided by the MOSIS Project. 
We are greatly indebted to the careful handling and attention they have given to our projects. 
Particularly we are grateful for the help provided by Terry Dosek, Wes Flansford, Sam DelaTorre, 
and Sam Reynolds. 



Contents 


I Introduction and Background 1 

1 Introduction 2 

1.1 Goals. 3 

1.2 Scope . 3 

1.3 Overview . 3 

1.3.1 Topology. 4 

1.3.2 Routing. 6 

1.3.3 Technology . 7 

1.3.4 Fault Management. 8 

1.4 Organization. 8 

2 Background 10 

2.1 Models. 10 

2.1.1 Fault Model. 10 

2.1.2 Multiprocessor Model. 11 

2.2 IEEE-1149.1-1990 TAP. 11 

2.3 Effects of Latency. 13 

2.4 Latency Issues. 14 

2.4.1 Network Latency . 14 

2.4.2 Locality. 16 

2.4.3 Node Flandling Latency. 17 

2.5 Laults in Large-Systems. 18 

2.6 Lault Tolerance. 19 

2.7 Pragmatic Considerations. 19 

2.7.1 Physical Constraints. 19 

2.7.2 Design Complexity . 20 

2.8 Flexibility Concerns. 20 

II Engineering Reliable, Low-Latency Networks 22 

3 Network Organization 23 

3.1 Low-Latency Networks. 23 

iii 



























3.1.1 Fully Connected Network. 23 

3.1.2 Full Crossbar . 23 

3.1.3 Flypercube. 24 

3.1.4 /,-ary-n-cube. 25 

3.1.5 Flat Multistage Networks. 27 

3.1.6 Tree Based Networks. 27 

3.1.7 Express Cubes. 29 

3.1.8 Summary . 29 

3.2 Wire Length. 29 

3.3 Fault Tolerance. 31 

3.3.1 Indirect Routing. 31 

3.3.2 k-ary-n -cubes and Express Cubes. 31 

3.3.3 Multiple Networks. 31 

3.3.4 Extra-Stage, Multistage Networks. 32 

3.3.5 Interwired, Multipath, Multistage Networks. 33 

3.4 Robust Networks for Low-Latency Communications. 34 

3.5 Network Design. 34 

3.5.1 Parameters in Network Construction. 35 

3.5.2 Endpoints. 35 

3.5.3 Internal Wiring. 36 

3.5.4 Network Yield Evaluation. 42 

3.5.5 Network Flarvest Evaluation. 48 

3.5.6 Trees. 48 

3.5.7 Flybrid Fat-Tree Networks. 50 

3.6 Flexibility. 51 

3.7 Summary. 53 

3.8 Areas to Explore. 54 

4 Routing Protocol 55 

4.1 Problem Statement. 55 

4.1.1 Low-overhead Routing . 55 

4.1.2 Flexiblity . 56 

4.1.3 Distributed Routing. 56 

4.1.4 Dynamic Fault Tolerance . 56 

4.1.5 Fault Identification. 56 

4.2 Protocol Overview. 56 

4.3 MRP in the Context of the ISO OSI Reference Model . 57 

4.4 Terminology. 57 

4.5 Basic Router Protocol. 60 

4.5.1 Signalling. 60 

4.5.2 Connection States. 60 

4.5.3 Router Behavior. 61 

4.5.4 Making Connections. 63 

4.6 Network Routing. 63 


IV 













































4.7 Basic Endpoint Protocol. 64 

4.7.1 Initiating a Connection. 64 

4.7.2 Return Data from Network. 65 

4.7.3 Retransmission . 66 

4.7.4 Receiving Data from Network. 67 

4.7.5 Idempotence. 67 

4.8 Composite Behavior and Examples. 68 

4.8.1 Composite Protocol Review. 69 

4.8.2 Examples . 69 

4.9 Architectural Enhancements. 73 

4.9.1 Avoiding Known Faults. 73 

4.9.2 Back Drop. 75 

4.10 Performance. 77 

4.11 Pragmatic Valiants. 78 

4.11.1 Pipelining Data Through Routers. 78 

4.11.2 Pipelined Connection Setup. 81 

4.11.3 Pipelining Bits on Wires. 81 

4.12 Width Cascading. 83 

4.12.1 Width Cascading Problem. 84 

4.12.2 Techniques. 85 

4.12.3 Costs and Implementation Issues . 87 

4.12.4 Flexibility Benefits. 87 

4.13 Protocol Features. 87 

4.13.1 Overhead . 88 

4.13.2 Flexibility. 88 

4.13.3 Distributed Routing. 88 

4.13.4 Fault Tolerance . 88 

4.13.5 Fault Identification and Focalization. 89 

4.14 Summary. 89 

4.15 Areas to Explore. 89 

5 Test and Reconfiguration 91 

5.1 Dealing with Faults . 91 

5.2 Scan-Based Testing and Reconfiguration. 92 

5.3 Robust and Fine-Grained Scan Techniques. 92 

5.3.1 Multi-TAP. 93 

5.3.2 Port-by-Port Selection. 96 

5.3.3 Partial-External Scan . 97 

5.4 Fault Identification. 97 

5.5 Reconfiguration. 98 

5.5.1 Fault Masking. 98 

5.5.2 Propagating Reconfiguration. 98 

5.5.3 Internal Router Sparing. 99 

5.6 On-Fine Repair . 99 


v 













































5.7 High-Level Fault and Repair Management. 102 

5.8 Summary. 103 

5.9 Areas To Explore. 103 

6 Signalling Technology 105 

6.1 Signalling Problem. 105 

6.2 Transmission Line Review. 105 

6.3 Issues in Transmission Line Signalling. 109 

6.4 Basic Signalling Strategy . 112 

6.5 Driver. 114 

6.6 Receiver. 116 

6.7 Bidirectional Operation. 118 

6.8 Automatic Impedance Control. 119 

6.8.1 Circuitry. 119 

6.8.2 Impedance Selection Problem. 120 

6.8.3 Impedance Selection Algorithm. 123 

6.8.4 Register Sizes. 126 

6.8.5 Sample Results . 126 

6.8.6 Sharing . 127 

6.8.7 Temperature Variation. 128 

6.9 Matched Delay. 129 

6.9.1 Problem. 130 

6.9.2 Adjustable Delay Pads. 130 

6.9.3 Delay Adjustment. 131 

6.9.4 Simulating Long Sample Registers . 133 

6.10 Summary. 134 

6.11 Areas to Explore. 134 

7 Packaging Technology 137 

7.1 Packaging Requirements. 137 

7.2 Packing and Interconnect Technology Review. 137 

7.2.1 Integrated Circuit Packaging. 137 

7.2.2 Printed-Circuit Boards. 138 

7.2.3 Multiple PCB Systems. 139 

7.2.4 Connectors. 140 

7.3 Stack Packaging Strategy . 141 

7.3.1 Dual-Sided Pad-Grid Anays. 141 

7.3.2 Compressional Board-to-Package Connectors. 142 

7.3.3 Printed Circuit Boards. 145 

7.3.4 Assembly. 147 

7.3.5 Cooling. 147 

7.3.6 Repair. 147 

7.3.7 Clocking. 148 

7.3.8 Stack Packaging of Non-DSPGA Components . 149 


vi 











































7.4 Network Packaging Example . 151 

7.5 Packaging Large Systems. 151 

7.5.1 Single Stack Limitations. 151 

7.5.2 Large-Scale Packaging Goals. 152 

7.5.3 Pat-tree Building Blocks. 153 

7.5.4 Unit Tree Examples. 154 

7.5.5 Hollow Cube. 155 

7.5.6 Wiring Hollow Cubes. 156 

7.5.7 Hollow Cube Support. 157 

7.5.8 Hollow Cube Limitations. 159 

7.6 Multi-Chip Modules Prospects. 159 

7.7 Summary. 160 

7.8 Areas To Explore. 160 

III Case Studies 161 

8 RN1 162 

9 Metro 165 

9.1 METRO Architectural Options . 165 

9.2 METRO Technology Projections . 165 

10 Modular Bootstrapping Transit Architecture (MBTA) 167 

10.1 Architecture. 167 

10.2 Performance. 167 

11 Metro Link 171 

11.1 MLINK Function. 171 

11.2 Interfaces. 172 

11.3 Primitive Network Operations. 172 

12 MBTA Packaging 175 

12.1 Network Packaging . 175 

12.2 Node Packaging. 175 

12.3 Signal Connectivity. 175 

12.4 Assembled Stack. 179 

IV Conclusion 182 

13 Summary and Conclusion 183 

13.1 Latency Review. 183 

13.2 Fault Tolerance Review. 185 

13.3 Integrated Solutions. 186 

vii 






























A Performance Simulations (by Frederic Chong) 187 

A.l The Simulated Architecture. 187 

A.2 Coping with Network I/O . 187 

A.3 Network Loading . 188 

A.3.1 Modeling Shared-Memory Applications. 188 

A.3.2 Application Descriptions. 188 

A.3.3 Application Data. 189 

A.3.4 Synchronization. 189 

A.3. 5 The flat24 Load . 189 

A.4 Performance Results for Applications. 192 


viii 












List of Figures 


1.1 16 X 16 Multibutterfly Network. 4 

1.2 Area-Universal Fat-Tree with Constant Size Switches. 5 

1.3 Cross-Section of Stack Packaging. 7 

2.1 Multiprocessor Model. 12 

2.2 Standard IEEE TAP and Scan Architecture . 12 

3.1 Fully Connected Networks. 23 

3.2 Full 16 X 16 Crossbar. 24 

3.3 Distributed 16 X 16 Crossbar . 24 

3.4 Flypercube. 25 

3.5 Mesh - /;-ary-??-cube with k = 2. 25 

3.6 Cube - /;-ary-??-cube with k = 3 . 26 

3.7 Torus - /,-ary-n-cube with k = 2 and Wrap-Around Torus Connections. 26 

3.8 16x16 Omega Network Constructed from 2x2 Crossbars. 27 

3.9 16 X 16 Bidelta Network. 28 

3.10 Benes Network. 28 

3.11 16 X 16 Multibutterfly Network. 29 

3.12 Express Cube Network -k = 2. 30 

3.13 Replicated Multistage Network . 32 

3.14 Extra Stage Network. 33 

3.15 4x2 Crossbar with a dilation of 2. 34 

3.16 16x16 Multibutterfly Network with Radix-4 Routers in Final Stage. 36 

3.17 Left: Non-expansive Wiring of Processors to First Stage Routing Elements .... 38 

3.18 Right: Expansive Wiring of Processors to First Stage Routing Elements. 38 

3.19 Pseudo-code for Deterministic Interwiring. 40 

3.20 16 X 16 Path Expansion Multibutterfly Network. 40 

3.21 Pseudo-code for Random Interwiring. 41 

3.22 Randomly-Interwired Network. 42 

3.23 Randomized Maximal-Fanout. 43 

3.24 Pseudo-code for Random, Maximal-Fanout Interwiring. 44 

3.25 16 X 16 Randomized, Maximal-Fanout Network . 45 

3.26 Completeness of (A) 3-stage and (B) 4-stage Multipath Networks. 45 

3.27 Comparative Performance of 3-Stage and 4-Stage Networks. 47 


IX 


































3.28 Chong’s Fault-Propagation Algorithm for Reconfiguration . 48 

3.29 Fault-Propagation Node Loss and Performance for 1024-Node Systems. 49 

3.30 Cross-Sectional View of Up Routing Tree and Crossover. 51 

3.31 Connections in Down Routing Stages (left). 52 

3.32 Up Routing Stage Connections with Lateral Crossovers (right) . 52 

3.33 Multibutterfly Style Cluster at Leaves of Fat-Tree. 52 

4.1 METRO Routing Protocol in the context of the ISO OSI Reference Model. 58 

4.2 Basic Router Configuration . 59 

4.3 MRP-ROUTER Connection States. 61 

4.4 16 X 16 Multibutterfly Network. 69 

4.5 Successful Route through Network . 70 

4.6 Connection Blocked in Network. 71 

4.7 Dropping a Network Connection. 71 

4.8 Reversing an Open Network Connection. 72 

4.9 Reversing a Blocked Network Connection. 73 

4.10 Reverse Connection Turn . 74 

4.11 Blocked Paths in a Multibutterfly Network. 75 

4.12 Example of Fast Path Reclamation. 76 

4.13 Backward Reclamation of Connection Stuck Open . 78 

4.14 Example Connection Open with Pipelined Routers . 79 

4.15 Example Turn with Pipelined Routers. 80 

4.16 Example of Pipelined Connection Setup. 82 

4.17 Example Turn with Wire Pipelining. 83 

4.18 Cascaded Router Configuration using Four Routing Elements. 86 

5.1 Mesh of Gridded Scan Paths. 93 

5.2 Scan Architecture for Dual-TAP Component. 95 

5.3 Propagating Reconfiguration Example. 100 

5.4 Propagating Reconfiguration Example. 101 

6.1 Initial Transmission Line Voltage Profile. 107 

6.2 Transmission Line Voltage: Open Circuit Reflection. 107 

6.3 Transmission Line Voltage: Z term > Zq Reflection. 108 

6.4 Transmission Line Voltage: Matched Termination. 108 

6.5 Transmission Line Voltage: Z term < Zq Reflection. 109 

6.6 Transmission Line Voltage: Short Circuit Reflection. 109 

6.7 Parallel Terminated Transmission Line. 110 

6.8 Serial Terminated Transmission Line . Ill 

6.9 CMOS Transmission Line Driver. 113 

6.10 Functional View of Controlled Output Impedance Driver. 114 

6.11 CMOS Driver with Voltage Controlled Output Impedance . 115 

6.12 CMOS Driver with Digitally Controlled Output Impedance. 116 

6.13 CMOS Driver with Separate Impedance and Logic Controls. 117 


x 












































6.14 Controlled Impedance Driver Implementation. 118 

6.15 CMOS Low-voltage Differential Receiver Circuitry. 119 

6.16 CMOS Low-voltage, Differential Receiver Implementation. 120 

6.17 Bidirectional Pad Scan Architecture. 121 

6.18 Sample Register. 121 

6.19 Driver and Receiver Configuration for Bidirectional Pad . 122 

6.20 Ideal Source Transition. 123 

6.21 More Realistic Source Transitions. 125 

6.22 Impedance Selection Algorithm (Outer Loop). 126 

6.23 Impedance Selection Algorithm (Inner Loop). 127 

6.24 Impedance Matching: 6 Control Bits. 128 

6.25 Impedance Matching: 3 Control Bits. 129 

6.26 1000 Impedance Matching: 6 Control Bits. 130 

6.27 Multiplexor Based Variable Delay Buffer. 131 

6.28 Voltage Controlled Variable Delay Buffer. 131 

6.29 Adjustable Delay Bidirectional Pad Scan Architecture. 132 

6.30 Sample Register with Selectable Clock Input . 133 

6.31 Sample Register with Recycle Option. 135 

6.32 Sample Register with Overlapped Recycle. 135 

7.1 Stack Structure for Three-dimensional Packaging. 141 

7.2 DSPGA372 . 143 

7.3 DSPGA372 Photos. 144 

7.4 BB372 . 145 

7.5 Button. 146 

7.6 Cross-section of Routing Stack . 148 

7.7 Close-up Cross-section of Mated BB372 and DSPGA372 Components. 149 

7.8 Sample Clock Fanout on Horizontal PCB. 150 

7.9 Mapping of Network Logical Structure onto Physical Stack Packaging. 152 

7.10 Two Level Hollow-Cube Geometry. 156 

7.11 Two Level Hollow Cube with Top and Side Stacks of Different Sizes. 157 

7.12 Three Level Hollow Cube. 158 

8.1 RN1 Logical Configurations. 162 

8.2 RN1 Micro-architecture. 163 

8.3 Packaged RN1 IC . 164 

10.1 MBTA Routing Network. 168 

10.2 MBTA Node Architecture. 169 

11.1 MLINK Message Formats. 173 

12.1 Routing Board Arrangement for 64-processor Machine. 176 

12.2 Packaged MBTA Node. 177 

12.3 Layer of Packaged Nodes. 178 


xi 











































12.4 Exploded Side View of 64-processor Machine Stack. 180 

12.5 Side View of 64-processor Machine Stack. 181 

A.l Applications on 3-stage Random Networks. 193 

A.2 Applications on the 3-stage Deterministic Network. 194 

A.3 Comparative Performance of 3-Stage Networks. 195 

A.4 Comparative Performance of 4-Stage Networks. 196 


xii 









List of Tables 


3.1 Network Comparison . 30 

3.2 Network Construction Parameters. 35 

3.3 Connections into Each Stage. 37 

3.4 Fault Tolerance of Multipath Networks . 46 

4.1 Control Word Encodings. 60 

6.1 Representative Sample Register Data. 124 

7.1 DSPGA372 Physical Dimensions. 144 

7.2 BB372 Physical Dimensions. 145 

7.3 Unit Tree Parameters. 154 

7.4 Unit Tree Component Summary. 154 

9.1 METRO Architectural Variables. 166 

9.2 Metro Router Configuration Options. 166 

A.l Relative Transaction Frequencies for Shared-Memory Applications. 190 

A.2 Split Phase Transactions and Grain Sizes for Shared-Memory Applications .... 191 

A.3 Message Lengths for Shared-Memory Applications. 191 


xiii 

















Part I 

Introduction and Background 


1 



1. Introduction 


The high capabilities and low costs of modem microprocessors have made it attractive from both 
economic and performance viewpoints to design and construct large-scale multiprocessors based 
on commodity processor technologies. Nonetheless, many challenges remain to effectively realize 
the potential performance promised by large-scale multiprocessing on a wide-range of applications. 
One key challenge is to provide sufficient inter-processor communication performance to allow 
efficient multiprocessor operation - and to provide such performance at a reasonable cost. 

In order for processors to work effectively together in a computation, they must be able to 
communicate data with each other in a timely fashion. The exact nature and role of communication 
varies with the particular programming model, but the need is pervasive. Virtually all paradigms for 
parallel processing depend critically on low communication latency to effectively exploit parallel 
execution to reduce total execution time. Communication latency is a critical determinant of the 
amount of exploitable parallelism and the cost of synchronization. For shared-memory algorithms, 
latency affects the speed of cache-replacement and coherency operations. In message-passing 
programs, latency affects the delay between message transmission and reception. In dataflow 
programs, latency determines the delay between the computation of a data value and the time when 
the value can actually be used. Data parallel operations are limited by the rate at which processors 
can obtain access to the data on which they need to operate. 

Multithreaded ([Smi78] [Jor83] [ALKK90] [SBCvE90] [CSS+91] [NPA92]) and dataflow 
([ACM88] [AI87] [PC90]) architectures have been developed to mitigate communication latency 
by hiding its effects. These techniques all rely on an abundance of parallelism to provide useful 
processing to perform while waiting on slow communications. The limit to the usable parallelism 
then, can be determined by the nature of the problem and the algorithm used to solve it, the rate of 
computation on each processor, and the communication latency. Our challenge today is to provide 
sufficiently low-latency communications to match the computation rate provided by commodity 
processors while allowing the most effective use of the parallelism inherent in each problem. 

Regardless of the exact network topology used for communications, both the number of switch¬ 
ing components and the amount of wiring inside the network are at least linear in the number of 
processors supported by the network. The single component failure rate is also linear in the network 
size. If we do not engineer the network to operate properly when faults exist, the acceptable failure 
rate for any system will directly fix a ceiling on the maximum machine size. To avoid this ceiling 
we consider network designs which can operate properly in the presence of faults. 

In this document, we examine a class of processor interconnection networks which are designed 
to simultaneously minimize network latency while maximizing fault tolerance. A combination of 
organizational techniques, protocols, circuit techniques, and packaging technologies are employed 
to realize a class of integrated solutions to these problems. 


2 




1.1 Goals 


Our goals in designing a high-performance network for large-scale multiprocessing are to 
optimize for: 

• Low Latency 

• High Bandwidth 

• High Reliability 

• Testability/Repairability 

• Scalability 

• Flexibility/Versatility 

• Reasonable Cost 

• Practical Implementation 

As suggested above and developed further in Sections 2.3 and 2.5, latency and reliability are key 
properties which must be considered when designing a large-scale, high-performance multipro¬ 
cessor network. Insufficient bandwidth will have a detrimental impact on latency (Section 2.4). 
Fault diagnosis and repair are key to limiting the impact of any faults in the network (Section 2.6). 
Scalability of the solution is important to maximize the longevity with which the solutions are 
effective. Flexibility in the solutions allow the class of networks to remain applicable across a wide 
range of specific needs (Section 2.8). 

1.2 Scope 

This work only attempts to address issues directly related to the network for a large-scale 
multiprocessor. Attention is paid to providing efficient and robust interfaces between processing 
nodes and the network. Attention is also given to how the node interacts with the network. However, 
the fault-tolerance schemes presented here do not guard against failures of the processing nodes or 
in the memory system. The scheme detailed here may be suitable for a reliable network substrate 
for future work in processor and memory fault recovery. 

1.3 Overview 

In this section, we provide a quick overview of the network design at several levels. This section 
should give the reader a basic picture of the class of networks and technologies being considered. 
Part II develops everything introduced here in detail. 


3 




A multibutterfly style interconnection network constructed from 4x2 (inputs x radix) 
dilation-2 crossbars and 2x2 dilation-1 crossbars. Each of the 16 endpoints has two inputs 
and outputs for fault tolerance. Similarly, the routers each have two outputs in each of 
their two logical output directions. As a result, there are many paths between each pair of 
network endpoints. Paths between endpoint 6 and endpoint 16 are shown in bold. 


Figure 1.1: 16 X 16 Multibutterfly Network 


1.3.1 Topology 

A suitable network topology is the first essential ingredient to producing a reliable, high- 
performance network. The network topology will ultimately dictate: 

• Switching Latency - the number of switches, and to some extent the length of the wires, 
which must be traversed between nodes in the network 

• Underlying Reliability - the redundancy available to make fault-tolerant operation possible 

• Scalability - the characteristic growth of resource requirements with system size 

• Versatility - the extent to which the network can be adapted to a wide-range of applications. 

To simultaneously optimize these characteristics, we utilize multipath, multistage interconnec¬ 
tion networks based on several key ideas from the theoretical community including multibutterflies 
[Upf89] [LM92] and fat trees [Lei85]. 

Using multibutterfly (See Figure 1.1) and fat-tree networks (See Figure 1.2), we minimize the 
number of routing switches which must be traversed in the network between any pair of nodes. 
Using bounded degree routing nodes, the least possible number of switches between endpoints 
is logarithmic in the size of the network, a lower bound which these networks achieve. For 
small machine configurations the multibutterfly networks achieve the logarithmic lower bound 
with a multiplicative constant of one {e.g. routing switches traversed = log r N ; where N is 


4 










Figure 1.2: Area-Universal Fat-Tree with Constant Size Switches (Greenberg and Leiserson) 


the number of processing nodes in the network and r is the radix of the routing component 
used for switching). For larger machine configurations, fat trees provide lower latency for local 
communication. Applications can take advantage of the locality inherent in the fat-tree topology to 
realize lower average communication latencies. To further minimize switching latency, our fat-tree 
networks make use of short-cut paths, keeping the worst-case switching latency down to | log 4 N 
when using radix-four routing components. 

The multipath nature of these routing networks provides a basis for fault-tolerant operation, as 
well as providing high bandwidth operation. The multipath networks provide multiple, redundant 
paths between every pair of processing nodes. The alternative paths are also available for min¬ 
imizing congestion within the network, resulting in increased effective bandwidth and decreased 
effective latency. When faults occur, the availability of alternative paths between endpoints makes 
it possible to route around faulty components in the network. 

A high-degree of scalability is achieved by using fat-tree organizations for large networks. The 
scalable properties of fat trees allow construction of arbitrarily large machines using the same basic 
network architecture. When organized properly, these large fat trees can be shown to minimize 
the total length of time that any message spends traversing wires within the routing network as 
compared to any other network. The hardware resources required for the fat-tree network grow 
linearly in the number of processors supported. 

Further, these networks provide considerable versatility allowing them to be adapted to meet 
the specific needs of a particular application. By selecting the number of network ports into each 



processing node, we can customize the bandwidth and reliability within the network to meet the 
needs of the application. By controlling the width of the basic data channel, we can provide 
varying amounts of latency and bandwidth into a node. This flexibility makes it possible to use 
the same basic network solutions across a broad range of machines from low-cost workstations to 
high-bandwidth supercomputers by selecting the network parameters appropriately. 

1.3.2 Routing 

While a good network topology is necessary for reliable, high-performance communications, 
it is by no means sufficient. We must also have a routing scheme capable of efficiently exploiting 
the features of the network. In developing a routing strategy for use with multiprocessor commu¬ 
nications networks, we focussed on achieving a routing framework with the following properties: 

1. Low-overhead routing - Low-overhead routing attempts to minimize the fraction of poten¬ 
tial bandwidth consumed by protocol overhead and similarly minimize the latency associated 
with protocol processing. 

2. Fault identification and localization with minimal overhead - To achieve fault tolerance, 
we must be able to detect when faults corrupt data in our system. Further to minimize the 
impact of faults on system performance, we must be able to efficiently identify the source of 
any faults in the system. 

3. Flexible protocol - To be suitable for use in a wide range of applications and environments, 
the protocol must be flexible allowing efficient layering of the required data transfer on top 
of the underlying communications. 

4. Dynamic fault tolerance - For the network to scale robustly to very large implementations, 
it is critical that the network and routing components continue to operate properly as new 
faults arise in the system. 

5. Distributed routing - In order to avoid single-points of failure in the system, routing must 
proceed in a distributed fashion, requiring the correct operation of no central resources. 

To this end, we have developed the Metro Routing Protocol, MRP, a simple, reliable, source- 
responsible router protocol suitable for use with multipath networks. MRP provides half-duplex, 
bidirectional data transmission over pipelined, circuit-switched routing channels. The simple pro¬ 
tocol coupled with pipelined routing allows for high-bandwidth, low-latency implementations. The 
circuit-switched nature avoids the issues associated with buffering inside the network. Each routing 
component makes local routing decisions among equivalent outputs based on channel utilization, 
using randomization to choose among equivalent alternatives. Routing components further provide 
connection information and checksums back to the source node to allow error localization within 
the network. When errors or blocking occurs, the source can retry data transmission. The ran¬ 
domization in path selection guarantees that any existing non-faulty path can eventually be found 
without global information. 


6 





Figure 1.3: Cross-Section of Stack Packaging (Diagram courtesy of Fred Drenckhahn) 

1.3.3 Technology 

Regardless of the advances we make in topology and routing, the ultimate performance of an 
implementation is limited by the implementation technology. Packaging density constrains the 
minimum lengths for interconnect and hence the minimum latency between routing components 
and nodes. Once our interconnection distances are fixed, data transmission latency is limited by 
the time taken to traverse the interconnect and to traverse component i/o pads. 

Packaging 

Our goal in packaging these networks is to minimize the interconnection distances between 
components. At the same time, we aim to utilize economical technologies and provide efficient 
cooling and repair of densely packaged components. The basic packaging unit is a three-dimensional 


7 
























































stack of components and printed-circuit boards (See Figure 1.3). Computational, memory, and 
routing components are housed in dual-sided land-grid arrays and sandwiched between layers of 
conventional PCBs. The land-grid arrays, with pads on both sides of the package, serve to both 
house VLSI components and provide vertical interconnect in the stack structure. Button boards 
are used to provide reliable, solderless connection between land-grid array packages and adjacent 
PCBs. The land-grid array and button board packages provide channels for coolant flow. The 
composite stack structure is compatible with both air and liquid cooling. The stack structure 
provides the necessary dense interconnection in all three physical dimensions allowing for minimal 
wiring distances between components. Using this technology, we can package an entire 64-node 
multiprocessor including the network and nodes in roughly 1' x 1' x 5". 

Signalling 

To minimize wire transit and component i/o time, we utilize series-terminated, matched- 
impedance, point-to-point transmission line signalling. Further, to reduce power consumption 
the i/o structures use low-voltage signal swings. By integrating a series-terminated transmission 
line driver into the i/o pads, we avoid the need to wait for reflections to settle on the PCB traces 
without requiring additional external components. The low-voltage, series-terminated drivers can 
switch much faster than conventional 5V-swing drivers. Initial experience with this technology 
indicates we can drive a signal through an output pad, across 30 cm of wire, and into an input pad 
in less than 5 ns. 

1.3.4 Fault Management 

Performance in the presence of faulty components and wires can be further improved by hiding 
the effects of faulty components. Using some novel, fault-tolerant additions to baseline IEEE 
1149.1-1990 JTAG scan functionality, we can realize an effective scan-based testing strategy. By 
configuring components with multiple test-access ports, the architecture is resilient to faults in the 
test system itself. With port-by-port deselection and scan capabilities, it is possible to diagnose 
potentially faulty network components online; i.e. , while the rest of the system remains fully 
operational. Furthermore, these facilities allow faulty wires and components to be configured out 
of the system so that they do not degrade system performance. Once localized using boundary 
scan, the system can log faulty components for later repair and make an accurate assessment of the 
system integrity. For larger systems, these facilities allow online replacement of faulty subsystems. 

1.4 Organization 

Before developing strategies for addressing these problems. Chapter 2 develops the problems 
and issues in further detail. Part II takes a detailed look at the key components of robust, low-latency 
networks. Chapter 3 leads off by examining the network topology. Chapter 4 addresses the issue 
of low-latency, high-speed, reliable routing on the networks introduced in Chapter‘3. Chapter 5 
considers fault identification and system reconfiguration. Chapter 6 develops suitable, high-speed 
signalling techniques compatible with the router-to-router communications required by networks the 
routing protocol. Finally, Chapter 7 looks at packaging technologies for practical, high-performance 



networks. Part III contains a brief series of case-studies from our experience designing and building 
reliable, low-latency networks. Chapter 8 reviews the RN1 routing component. Chapter 9 discusses 
RN 1 ’s successor, the METRO router series. Chapter 11 describes METRO-LINK, a network interface 
suitable for connecting a processing node into a METRO based network. Finally, Chapters 10 and 12 
discuss MBTA, an experimental multiprocessor which puts most of the technology described in 
Part II and the components detailed in Part III together in a complete multiprocessor system. 
Chapter 13 concludes by reviewing the techniques introduced in Part II and showing how they 
come together to achieve low-latency and fault-tolerant operation. 


9 



2. Background 


This chapter provides background material to prepare the reader for the development in Parts II 
and III. Section 2. 1 describes the fault model and multiprocessor model assumed throughout this 
document. Section 2.2 provides a brief review of standard scan based testing practices. Section 2.3 
and 2.5 point out the importance of low latency and fault tolerance to large-scale multiprocessor 
systems. Section 2.4 reviews the composition of network latency. Section 2.6 looks at the 
requirements for fault tolerance. Finally, Sections 2.7 and 2.8 introduce several other key issues in 
the practical design of interconnection networks. 

2.1 Models 

2.1.1 Fault Model 

Faults occurring in a network may be either static or dynamic and may be transient faults or 
permanent faults. While a permanent fault occurs and remains a fault, a transient fault may only 
persist for a short period of time. Transient faults which recur with notable frequency are termed 
intermittent. [SS92] indicate that transient and intermittent faults account for the vast majority of 
faults which occur in computer systems. For the puiposes of this presentation, static faults are 
permanent or intermittent faults which have occurred at some point in the past and are known to 
the system as a whole. Dynamic faults are transient faults or any faults which the system has not 
yet detected. 

Throughout this work, we assume that faults manifest themselves as: 

1 . Stuck-Values - a data or control line appeal's to be held exclusively high or low 

2. Random bit flips - a data or control line has some incorrect, but random value 

Faults may appear and disappear at any point in time. They may become permanent and remain 
in the system, they may be transient and disappear, or they may be intermittent and recurring. 
Stuck-value errors may take on an arbitrary, but constant, logic value. Bit flips are assumed to take 
on random values. Specifically, we are not assuming an adversarial fault model (e.g. [MR91]) in 
which faulty portions of the system are allowed to take on arbitrary erroneous values. 

These fault-manifestations are chosen to be consistent with fault expectations in digital hardware 
systems. Structural faults in the interconnect between components may give rise to floating or 
shorted nodes. With proper electrical design, floating i/o’s can appear as stuck-values to internal 
logic. Shorted nodes will depend on the values present on the shorted nodes and may appeal' as 
random bit flips when the values differ. Clocking, timing, and noise problems which cause incorrect 
data to be sampled by a component will also appeal' as random bit errors. Opens and bridging faults 
within an IC may also leave nodes shorted or floating. For a good survey of physical faults and 
their manifestations see Chapter 2 in [SS92], 


10 




The manner in which we handle dynamic faults in this work relies on end-to-end checksums to 
make the likelihood that a corrupted message looks like a good message arbitrarily small. As long 
as faults produce random data, we can select a checksum which has the desired property. However, 
if we allow arbitrary, malicious intervention as in an adversarial fault model, the adversary could 
remove a corrupted message from the network and replace it with one which looks good or remove 
a good message from the network and fake an acknowledgment. In order to handle this stronger 
fault-model, one would have to replace our practice of guarding data with checksums with an 
end-to-end data encryption scheme. A properly chosen encryption scheme could make the chances 
that an adversary could fake any message sufficiently remote for any particular' application. 

For the sake of the presentation here, we limit our concern to faults within the network itself. 
The processing nodes are presumed to function correctly, if at all. A processing node may cease to 
function, but it may not provide erroneous data to the network. All network transactions requested 
by the node are presumed to be intentional. The computational implications of losing access to an 
ongoing computation or the memory stored at a failing node are important but beyond the scope of 
this work. 

Without knowing the reliability design of the computational system as a whole, it is not clear 
whether a fault-tolerant network should be designed to optimize for harvest or yield. Yield is the 
term used to describe the likelihood that the system can be used to complete a given task. If we 
require that all nodes be fully connected to the network, then designing the network is a yield 
problem in which the network is only considered good when it provides full connectivity. In this 
case, we want to optimize for the highest yield at the fault levels of interest. Harvest Rate is 
the term used to refer to the fraction of total functional unit which are usable in a system. If the 
computational model can cope with the node loss, then designing the network is a harvest problem 
in which we attempt to optimize for the most connectivity at any fault level. 

2.1.2 Multiprocessor Model 

For the puipose of discussion, we assume a homogenous, distributed memory, multiprocessor 
model as shown in Figure 2.1. Each node is composed of a processor, some memory, and a network 
interface. In a hardware-supported shared-memory machine, this network interface might be the 
cache-controller [LLG + 91] [ACD+91]; in a message-passing machine, it would be the network 
message interface [Cor91] [Thi91]. Increasingly, the network interface may be tightly-integrated 
with the processor [D + 92] [NPA91]. We explicitly assume the network interface has multiple 
connections both into the network and out of the network. Multiple connections are necessary 
to avoid having a potential single point of failure at the connection between each node and the 
network. 

2.2 IEEE-1149.1-1990 TAP 

In Part II, we introduce extensions to standard, scan-based testing practices to make them 
suitable for use in large-scale systems. This section reviews the major points of the existing 
standard upon which we are building. 

The IEEE Standard Test-Access Port (TAP) [Com90] defines a serial test interface requiring 
four dedicated I/O pins on each component. The standard allows components to be daisy-chained 


11 




Figure 2.1: Multiprocessor Model 


Boundary Register 


Scan Register 


Bypass Register 




TMS 

TCK 


Instruction 
Decode 


^ Instruction Register 


TAP 

Controller 


Mux 


-TDO 


Figure 2.2: Standard IEEE TAP and Scan Architecture 


so that a single test path can provide access to many or all components in a system. The standard 
provides facilities for external boundary-scan testing, internal component functional testing, and 
internal scan testing. Additionally, the TAP provides access to component-specific testing and 
configuration facilities. Figure 2.2 shows the basic architecture for an IEEE scan-based TAP. 

In a system in which all components comply with the standard, boundary-scan testing allows 
complete structural testing. Using the serial scan path, every I/O pin in the system can be configured 


12 








to drive a logic value or act as a receiver. Using the same serial scan path, the value of every receiver 
can be sampled and recovered. This mechanism allows the TAP to verify the complete connectivity 
of the components in the system. All connectivity faults, shorted wires, stuck drivers or receivers, 
or open-circuits can be identified in this manner [GM82] [Wag87], 

The scan path allows data to be driven into a component independent of the values present on 
the component’s external I/O pins. The resultant values generated by the component in response 
to the driven data can similarly be sampled and recovered via the serial scan path. This facility 
permits functional, in-circuit verification of any such component. 

The standard allows additional instructions which may function in a component-specific manner. 
These instructions provide uniform access to internal-component scan-paths. Such internal paths 
are commonly used to allow a small number of test-patterns to achieve high-fault coverage in 
components with significant internal state. Other common additions are configuration registers and 
Built-In-Self-Test (BIST) facilities [KMZ79] [LeB84] [Lak86]. 


2.3 Effects of Latency 1 


For the sake of understanding the role of latency in multiprocessor communications, we consider 
a very simple model of parallel computation. To solve our problem we need to execute a total 
number of operations, c. Let us assume our problem is characterized by a constant amount of 
parallelism, p. During each clock cycle, we can perform p operations. Parallelism is limited 
because each set of p operations depends on the results of the previous p operations. After a set 
of operations complete, they must communicate their results with the processors which need those 
results for the next set of p operations. Let us assume that communicating between processors 
requires / clock cycles of latency. 

If we executed our program on a multiprocessor with more than p nodes, it would take time 
Tmuitiproc cycles to solve the problem. 


Tmultiproc 


C-(l+ 1) 

p 


( 2 . 1 ) 


At clock cycle 1, we can execute p operations on the nodes. We then require / cycles to communicate 
the results. The next p operations can then be executed in cycle / + 2. Computation continues 
in this manner executing p operations every (/ + 1) cycles. Thus operations are executed, on 
average, each cycle giving us Equation 2.1. 

We see immediately that the exploitable parallelism is limited by the latency of communication. 
If our problem allows much more parallelism than we have nodes in our multiprocessor, we 
can hide the effects of latency by performing other operations in a set while waiting for the 
communication associated with the earlier operations to complete. Flowever, if we wish to use 
large-scale multiprocessors to solve big problems, latency directly acts to limit the extent to which 
we can exploit parallel execution to solve our problem quickly. 

In most parallel programs, the number of operations which can be executed in parallel varies 
throughout the program’s execution. Flence p is not a constant. Researchers have characterized 
this parallelism for particular programs and computational models using a parallelism profile which 

1 The basic argument presented here is drawn from an unpublished manuscript by Professor Michael Dertouzos. 


13 



shows the number of operations which may be executed simultaneously at each time-step assuming 
an unbounded number of processors ( e.g . [AI87]). The available parallelism will be a function of 
the compiler and run-time system in addition to being dependent on the problem being solved and 
the algorithm used to solve it. 

Communication latency is also, generally, not constant. Section 2.4 looks at the factors that 
affect latency in a multiprocessor network. 

Despite the fact that our model used above is overly simplistic, it does gives us insight into 
the role which latency plays in parallel computing. When our algorithm, compiler, and run-time 
system can discover much more parallelism than we have processing elements to support, with 
good engineering we can hide some or all of the effects of latency. On the other hand, when we are 
unable to find such a surplus of parallelism, latency further derates the exploitable parallelism in a 
linear fashion. 

2.4 Latency Issues 

In this section, we consider in further detail many of the issues relevant to achieving low-latency 
communications. 

2.4.1 Network Latency 

Ignoring protocol overhead at the destination or receiving ends of a network, the latency in an 
interconnection network comes from four basic factors: 

1. Transit Latency ( I) ): The amount of time the message spends traversing the interconnection 
media within the network 

2. Switching Latency (T s ): The amount of time the message spends being switched or routed 
by switching elements inside the network 

3. Transmission Time ( T trans mt )■ The time required to transmit the entire contents of a message 
into or out-of the network 

4. Contention Latency ( 7 ): The degradation in network latency due to resource contention in 
the network 

Transit latency is generally dictated by physics and geometry. Transit latency is the quotient of 
the physical distance and the rate of signal propagation. 

T t = - ( 2 . 2 ) 

v 

Basic physics limits the amount of time that it takes for a signal to traverse a given distance. Materials 
will affect the actual rate of signal propagation, but regardless of the material, the propagation speed 
will always be below the speed of light, c«3x 10 1 () cm/s. The rate of propagation is given by: 

1 

v = - 

cW 



14 



For most materials // ~ //q, where /iq is the permittivity of free space. Conventional printed-circuit 
boards (PCBs) have e = e r eo, where < r ~ 4 and eo is the dielectric constant of free space; thus, 
v ~ §. Fligh performance substrates have lower values for <y. The physical geometry for the 
network determines the physical interconnection distances, d. Physical geometry is partially in the 
domain of packaging (Chapter 7), but is also determined by the network topology (Chapter 3). All 
networks are limited to exploiting, at most, three-dimensional space. Even in the best case, the 
total transit distance between two nodes in a network is at least limited by the physical distance 
between them in three-space. Additionally, since physical interconnection channels ( e.g. wires, 
PCB traces, silicon) occupy physical space, the volume these channels consume within the network 
often affects the physical space into which the network and nodes may be packed. 

For networks with uniform switching nodes, switching latency is the product of the number of 
switching stages between endpoints, s n , and the latency of each switching node, /„/. 

T s = s n ■ t n i (2.4) 

The network topology dictates the number of switching stages. The latency of each switching node 
is the sum of the signal i/o latency, and the switching node functional latency, t switch- 


Ini — I'io I'switch 


(2.5) 


The signal i/o latency, or the amount of time required to move signals into and out-of the switching 
node, is generally determined by the signalling discipline and the technologies used for the switching 
node (Chapter 6). The switch functional latency accounts for the time required to arbitrate for an 
appropriate output channel and move message data from the input channel to the output channel. In 
addition to technology, the switch functional latency will depend on the complexity of the routing 
and arbitration schemes and the complexity of the switching function (Chapter 4). Larger switches 
generally require more complicated arbitration and switching, resulting in larger inherent switching 
latencies. 

The transmission time accounts for the amount of time required to move the entire message 
data into or out-of the network. In many networks, the amount of data transmitted in a message is 
larger than the width of a data channel. In these case, the data is generally transmitted as a sequence 
of data where each piece is limited to the width of the channel. Assuming we have a message of 
length L to send over a channel w bits wide which can accept new data every t c time units, we have 
the transmission time, 7given by: 


2 "transmit 



( 2 . 6 ) 


Flere we see one of the places where low bandwidth has a detrimental effect on network latency. 
Transmit increases as the channel bandwidth decreases. 

Contention latency arises when resource conflicts occur and a message must wait until the nec¬ 
essary resources are available before it can be delivered to its designated destination. Such conflicts 
result when the network has insufficient bandwidth or the network bandwidth is inefficiently used. 
In packet-switched networks, contention latency manifests itself in the form of queuing which must 
occur within switches when output channels are blocked. In circuit-switched networks, contention 
latency is incurred when multiple messages require the same channel(s) in the network and some 


15 



messages must wait for others to complete. Contention latency is the effect which differentiates 
an architecture’s theoretical minimum latency from its realized latency. The amount of contention 
latency is highly dependent on the manner in which an application utilizes the network. Contention 
latency is also affected by the routing protocol (Chapter 4) and network organization (Chapter 3). 
One can think of contention latency as a derating factor on the unloaded network latency. 


T-unloaded — T s + T t (2.7) 

Tnet — i application, topology • T un i oaded T transmit (2.8) 

One of the easiest ways to see this derating effect is when an application requires more bandwidth 
between two sets of processors than the network topology provides. In such a case, the effective 
latency will be increased by a factor equal to the ratio of the desired application bandwidth to the 
available network bandwidth, e.g. if 4/,„ is the bandwidth needed by an application, and .Y/,„ is 
the bandwidth provided by the network for the required communication, we have: 


' N bw 

In practice, the derating factor is generally larger than a simple ratio due to the fact that the resource 
conflicts themselves may consume bandwidth. For example, on most local-area networks, when 
contention results in collisions, the time lost during the collision adds to the network latency as 
well as the time to finally transmit the message. 

The effects of contention latency make it clear why a bus is inefficient for multiprocessor 
operation. The bus provides a fixed bandwidth, Xi, ir . There is no switching latency and generally 
a small transit latency over the bus. Flowever, as we add processors to the bus, the bandwidth 
potentially usable by the application, A bw , generally increases while the network bandwidth stays 
fixed. This translates into a large contention derating factor, % and consequently high network 
latency. 

Unfortunately, it is hard to quantify the contention latency factor as cleanly as we can quantify 
other network latency factors. The bandwidth required between any pair of processors is highly 
dependent on the application, the computational model in use, and the run-time system. Further, it 
depends not just on the available bandwidth between a pair of processors, but between any sets of 
processors which may wish to communicate simultaneously. 

2.4.2 Locality 

Often physical and logical locality within a network can be exploited to minimize the average 
communication latency. In many networks, nodes are not equidistant. The transit latency and 
switching latency between a pair of nodes may vary greatly based on the choice of the pair of 
nodes. Logical distance is used to refer to the amount of switching required between two nodes 
§Ts), and physical distance is used refer to the transit latency (1,1) required between two nodes. Thus, 
two nodes which are closer, or more local, to each other logically and physically may communicate 
with lower latency than two nodes which are further apart. Additionally, when logically close nodes 
communicate they use less switching resources and hence contribute less to resource contention in 
the network. 


16 



The extent to which locality can be exploited is highly dependent upon the application being run 
over the network. The exploitation of network locality to minimize the effective communication 
latency in a multiprocessor system is an active area of current research [KLS90] [ ACD + 91] [Wal92]. 
Exploiting network locality is of particular' interest when designing scalable computer systems since 
the latency of the interconnect will necessarily increase with network size. Assuming the physical 
and logical composition of the network remains unchanged when the network is grown, for networks 
without locality, the physical distance between all nodes grows as the system grows due to spatial 
constraints. For networks with locality the physical distance between the farthest separated nodes 
grows. Additionally, as long as bounded-degree switches (Section 2.7.1) are used to construct the 
network, the logical distance between nodes increases as well. Locality exploitation is one hope 
for mitigating the effects of this increase in latency. 

It is necessary to keep the benefits due to locality in proper perspective with respect to the entire 
system. A small gain due to locality can often be dwarfed by the fixed overheads associated with 
communication over a multiprocessor network. Locality optimizations yield negligible rewards 
when the transmission latency benefit is small compared to the latency associated with launching and 
handling the message. Johnson demonstrated upper bounds on the benefits of locality exploitation 
using a simple mathematical model [Joh92]. For a specific system ([ ACD+91 ]), he shows that 
even for machines as large as 1000 processors, the upper bound on the performance benefit due to 
locality exploitation is a factor of two. 

2.4.3 Node Handling Latency 

This document concentrates on designing the network for a high-performance multiprocessor. 
Nonetheless, it is worthwhile to point out that the effective latency seen by the processors is also 
dependent on the latency associated with getting messages from the computation into the network, 
out-of the network, and back into the computation. Network input latency, T p , is the amount of time 
after a processor decides to issue a transaction over the network, before the message can be launched 
into the network, assuming network contention does not prevent the transaction from entering the 
network. Similarly, network output latency, T w , is the amount of time between the arrival of the a 
complete message at the destination node and the time the processor may begin actually processing 
the message. If not implemented carefully, large network input and output latency can limit the 
extent to which low-latency networks can facilitate low-latency communication between nodes. 
Combining these effects with our network latency we have the total processor to processor message 
latency: 

Tmessage = T p + T net + T w (2.9) 

This document will not attempt to directly address how one minimizes node input and output 
latency. Node latencies such as these are highly dependent on the programming model, processor, 
controller, and memory system in use. [NPA92] and [D+92] describe processors which were 
designed to minimize these latencies. [E + 92] and [CSS + 91] describe a computational model 
intended to minimize these latencies. Here, we will devote some attention to assuring that the 
network itself does not impose limitations which require large node input and output latencies. 


17 



2.5 Faults in Large-Systems 

In this section we review a first-order model of system failure rates. We use this simple model 
to underscore the importance of fault tolerance in large-scale systems. 

For the sake of simplicity, let us begin by considering a simple discrete model of component 
failures. A single component fails with some probability, P c in time T. This gives us a failure rate: 

K = Y (2 - 10) 

The probability that the component survives a period of time T, is then: 

Pcs = 1 - Pc ( 2 . 11 ) 

If we have a system with N components which fails if any of the individual component fail, then 
the system survives a period of time T only if all components survive the period of time T. Thus: 

Pss = P? s =(1 -Pcf ( 2 . 12 ) 

For any reasonable component and a small time period, T, P c << 1. To first order, Equation 2.12 
can be reasonably be approximated as: 


P ss = (1 -N-P c ) 

Which tells us the probability that the system fails during time T is simple: 

Ps = N • P c 


(2.13) 


(2.14) 


Which corresponds to a failure rate: 

N ■ P 

K=- F ± = N-\ p (2.15) 

From Equation 2.15 we see that the failure rate increases linearly with the number of components 
in the system, to first order. 

Example A moderate complexity, modern component has a failure rate of ten failures per million 
hours (A c ~ 10 -5 hr -1 ) (See [oD86] for estimating component failure rates). A million component 
machine which depended on all million components working correctly, would have: 

A s = N ■ X p = 10 6 X 10 -5 hr -1 = 10/hr (2.16) 

This gives the machine a Mean Time To Failure (MTTF) of 6 minutes. 

If we can relax the requirement that all components and interconnect function correctly in order 
for the system to be operational, we can improve the MTTF. If we can sustain k faults before the 
system is rendered inoperative, the MTTF will be longer. As long as k << N, we can assume a 
constant failure rate for components given by Equation 2.15. Assuming the faults are independent, 
the rate of occurrence of k failures is: 

A k = Y (2.17) 


18 



That is, the MTTF increases linearly with the number of tolerated faults. Revisiting our example, 
if we design the system to sustain 1000 faults (k = 1000), or just 0.1% of the total components in 
the system, the MTTF increases by a factor of 1000 to 6000 minutes or 100 hours. 

For sufficiently large systems, we cannot achieve an adequately low system failure rate by 
requiring that every component in the system function properly. Rather, we must design sufficient 
redundancy into our system to achieve the reliability desired. 

2.6 Fault Tolerance 

In order to achieve fault tolerance, we need the ability to detect when faults have occurred and 
the ability to handle faults which have occurred. Typically, one uses redundancy in some form to 
satisfy both of these needs. Redundant data transmitted along with the message can be used to 
identify when portions of a message are damaged. Parity bits and message checksums are common 
examples of redundant data used to identify data corruption. Once faults are detected, we rely on 
redundant network hardware to avoid faulty portions of the network. That is, there must be different 
resources which perform the same function as the faulty portion of the network which can be used 
in place of the faulty portion. We also need a mechanism for exploiting the redundancy. The 
network organization (Chapter 3) often provides the resource redundancy. The routing protocol 
(Chapter 4) provides the redundancy for fault detection and provides mechanisms for exploiting 
the redundancy in the network. 

Designing networks to perform well in the presence of faults is very similar - to designing 
networks to perform well in the presence of contention. Faults in the network look much like 
contention. Faulty resources are not useful for effectively routing data. In this manner, they have 
the same effect as resources which are always in use. Faulty resources also cause additional traffic 
in the network since they may corrupt messages and hence require the messages to be retransmitted. 
Alternately, we can think of the faulty resources as migrating routing traffic which they would have 
handled to other resources in the network. These non-faulty resources now see more traffic as a 
result. Appropriate design can yield solutions which improve both the performance of the system 
in the face of faults and the performance of the system in the face of heavy traffic. 

2.7 Pragmatic Considerations 

We must also consider several pragmatic considerations associated with building any systems. 
When building a system, such as a network, we are constrained by the economics of currently 
available technology, issues of design complexity, and fundamental physical constraints. 

2.7.1 Physical Constraints 

For instance, we have already observed that the speed of signal propagation is largely fixed by 
the speed of light and the dielectric constant of readily available materials. Materials with notably 
lower dielectrics do exist, but the cost and reliability of these materials currently relegates their use 
to small, high-end systems. As technology improves, we can expect these or other materials with 
lower dielectric constants to be available at prices which make their use more worthwhile. 


19 



One might consider using light, itself to achieve the maximum transmission rate (v = c). In 
some situations, this makes the most sense. However, the fact that signals must be converted 
from propagating electrons to propagating photons and back again, often defeats any potential 
gains. The latency associated with converting from electrons to photons and back is currently 
large. Even assuming 100% power effeciency, modern optical modulator/detectors have at least an 
order of magnitude more i/o latency, /, 0 , than purely electrical i/o pads (e.g. 40 ns versus 3 ns) 
at comparable power levels [LHM+89]. Since optical detection latency is inversely proportional 
to the incident power level, the optical conversion would require an order of magnitude greater 
power than the electrical pads to make the optical i/o latency comparable to electrical i/o latency. It 
only makes sense to make this optical conversion when the distance traversed is sufficiently large 
that the reduction in physical transit latency due to faster propagation is larger than the conversion 
latencies. 

Current VLSI technology limits the bonding of i/o pads to the periphery of the integrated circuit 
die. This forces the number of i/o channels into an integrated circuit (IC) to be proportional to the 
perimeter of the die. Due to external bonding requirements, i/o pads are shrinking more slowly than 
other IC features. Consequently, ICs have a fairly fixed, limited number of i/o pads and this number 
is not scaling comparable to the rate of scaling of useful silicon area inside the die. Available 
technology, thus, limits the number of i/o channels into an IC and hence the size of the primitive 
switching elements we can build. 

We must always take account of the fact that wires and components consume space. The finite 
thickness of wires limits the physical compactness of our multiprocessor. The space between nodes 
and routers must be large enough to accommodate the wires necessary to provide interconnect. In 
some topologies, the growth rate of the machine is dictated by the growth rate for the interconnect 
as much as the number and size of components. Additionally, space must be provided for adequate 
component cooling and access for repair. 

2.7.2 Design Complexity 

Each different component in a system requires separate: 

• Engineering effort to design and verify 

• Non-recurring engineering (NRE) costs to produce 

• Testing to select good components and diagnose potentially faulty components 

• Shelf-space to stock the components 

Consequently, it is beneficial to minimize the number of different components used in constructing 
any system. 

2.8 Flexibility Concerns 

Just as engineering more types of components is costly in terms of development, NRE, and 
testing, designing a new network for each new application or specific machine is also costly. 
We look for solutions which provide a wide range of flexibility so they can easily be extended 


20 



or re-parameterized to solve a variety of problems. When building a network for a large-scale 
multiprocessor, our desire for flexibility leads us to be concerned about the following: 

• How do we provide additional bandwidth for each node at a given level of semiconductor 
and packaging technology? 

• How do we get more/less fault tolerance for applications which have a higher/lower premium 
for faults 

• How do we build larger (smaller) machines? 

• How can we decrease latency? at what costs? 


21 



Part II 

Engineering Reliable, Low-Latency 

Networks 


22 



3. Network Organization 


In this chapter we survey potential low-latency networks and identify a family of networks which 
is most suitable for use in large-scale, fault-tolerant multiprocessors given practical considerations. 
After determining the basic network structure, we examine the issues involved in optimizing a 
particular network for a given application. 

3.1 Low-Latency Networks 

3.1.1 Fully Connected Network 

From the standpoint of latency, the optimal network is a fully-connected network in which 
every processor has a direct connection to every other processor (See Figure 3.1). Here, there is no 
switching latency (i.e. T s = 0). The problem with this network, of course, is that the processor 
node size grows linearly with the size of the system. This is not practical for several reasons. We 
cannot build very large networks with bounded pin-out components, and a different component size 
is needed for each different network size. Using techniques from [Tho80] and [LR86], we find the 
interwiring resources will grow as ©( N 3 ). Wiring constraints alone require that the best packaging 
volume grows as 0( N 3 ), making, in the best case, the wiring distances, d, grow as ©( N ). Such an 
organization is not very practical. 

3.1.2 Full Crossbar 

Next, we consider a full crossbar arrangement (See Figure 3.2). If we could build a large enough 
crossbar, we only traverse on switching node between any source-destination pair. Unfortunately, 



Figure 3.1: Fully Connected Networks 


23 










Figure 3.2: Full 16 X 16 Crossbar 



Figure 3.3: Distributed 16x16 Crossbar 

our pin limitations (Section 2.7.1), will not allow us to build a single crossbar of arbitrary size. In 
practice, we would have to distribute the function across many different components as shown in 
Figure 3.3. This would incur O(n) switching latency and require ()[ ir ) such switches. 

3.1.3 Hypercube 

We might consider building a hypercube network to exploit locality and distributed routing 
control. The switching latency is log JrJV ) as we need traverse at most one switching link in 
each dimension of the hypercube. Unfortunately, to maintain this characteristic, the switching 
node degree grows as ©(log(iV)). Node size soon runs into our pin limitations (Section 2.7.1) 
and a different size node is needed for each size of the machine constructed. Additionally, when 
implemented in three-dimensional space, the interconnection requirements cause the machine 

3 

volume to grow as 0( N 2 ). This result is also derivable from the techinques presented in [Tho80] 
and [LR86] by considering the number of wires which must cross through the middle of the machine 


24 




Shown above is a 16 processor hypercube. (Drawing by Frederic Chong) 


Figure 3.4: Flypercube 


Figure 3.5: Mesh - /,-ary-n-cube with k = 2 


in any decomposition. If we divide an N -processor machine in half, the number of wires crossing 
the bisecting plane will be 0( N ). If we distribute these wires in the two-dimensional plane dividing 
the two halves, then the plane is ©( \/N ) wire widths wide in each dimension. Considering that we 
get the same effect if we divide the machine via an orthogonal plane which also bisects the machine, 
we see that the machine is ©( \J~N ) long in each dimension and hence the volume is From 

this we can see that the transit distance, d, will generally grow as ©( \/~N). 

Making some compromises for practicality on the basic hypercube structure, a number of 
derivative networks result. The next two sections cover two major classes, multistage networks 
and k-ary-n-cubes. 

3.1.4 /,-ary-//-cube 

For /,-ary-//-cubes, we fix the dimension (/,-) to avoid the switching node size growth problem 
associated with the pure hypercube. We still get the locality and distributed routing. The switching 
latency grows as O ( \/N ) since there are at most \/N routers which must be traversed in each 


25 






Shown above is a 27 processor cube network. (Drawing by Frederic Chong) 


Figure 3.6: Cube - /r-ary-??-cube with k = 3 



Figure 3.7: Torus - /,'-ary-//-cube with k = 2 and Wrap-Around Torus Connections 


dimension. Many popular /,-ary-//-cubes networks in use today set /,• = 2 or /,- = 3 to build mesh 
(See Figure 3.5) or cube (See Figure 3.6) structures [Dal87]. For these networks, the distances 
between components can be made uniformly short such that the switching latency dominates the 
transit latency. When constrained to three-dimensional space, larger values of k, will tend to have 
transit latencies which scale as Q( \X~Y ). Toroidal /, -ary-//-cubes can be used to cut the worst case 
switching latency in each dimension in half and avoid hot-spot problems in simple /, -ary-//-cubes 
(See Figure 3.7) [DS86]. 


26 






Figure 3.8: 16 X 16 Omega Network Constructed from 2x2 Crossbars 

3.1.5 Flat Multistage Networks 

A multistage network distributes each hypercube routing element spatially so that fixed-degree 
switches can be used for routing. Like the hypercube, routing can occur in a distributed manner 
requiring only log r ( N ) stages between any pair of nodes in the network. Flere r is a constant 
known as the radix which denotes the number of distinct directions to which each routing switch 
can route. Unlike the hypercube and /,'-ary-re-cube, the multistage network does not provide any 
locality. The number of switches required by a multistage network grows as ()( A log( A )). The 
best-case packaging volume grows as &(N 5 ) and the transit latency grows as 0( \/N ) like the 
hypercube [LR86]. 

Quite a variety of networks can be classified as multistage networks including: Butterfly net¬ 
works, Banyan networks, Bidelta networks [KS86], Benes networks, and Multibutterfly networks. 
Figures 3.8 through 3.11 show some popular multistage networks. Each stage in these networks 
routes by successively subdividing the set of possible destinations into a number of equivalence 
classes equal to the radix of the routing components. For example, consider a radix-2 network. 
When connections enter the network, any input can reach any destination. The first stage of routing 
components divides this class into two different equivalence classes based on desired destination. 
Each succeeding network stage further subdivides a previous stage’s equivalence classes into two 
more equivalence classes. When there is a single destination in each equivalence class, the network 
has uniquely determined the desired destination and can connect to the destination endpoints. This 
successive subdivision can be easily seen in the network shown in Figure 3.9. 

3.1.6 Tree Based Networks 

Properly constructed, a tree-based, multistage network avoid the major liabilities associated 
with the standard multistage networks. Specifically, we consider fat-tree networks as described 
in [Lei85] and [GL85] and shown in Figure 1.2. The switching delay remains 0(log( A")) as 


27 




Figure 3.9: 16 X 16 Bidelta Network 



Figure 3.10: Benes Network 


with hypercubes and multistage networks. Routing may occur in a distributed fashion. Unlike the 
multistage networks described above, the tree-based networks do allow locality exploitation. When 
the bandwidth between successive stages of the tree is chosen appropriately, the tree structures can 
be arranged efficiently in three-dimensional space; switching and wiring resources grow as 0( N) 
and transit latency will grow as 0( \fN ). While a tree-based network may have less cross-machine 
bandwidth than a hypercube with the same number of nodes, the tree-based machine requires 
O (log( N )) less interconnect hardware. As a result, if one were to compare machines of the same 
size, taking into account three-dimensional space restrictions, the free machine provides at least as 
much bandwidth while supporting O (log( N )) more nodes. Leiserson shows that properly sized fat 
Pees can efficiently perform any communication performed by any other similarly sized network 


28 




Figure 3.11: 16 X 16 Multib utter fly Network 


[Lei 8 5]. 

3.1.7 Express Cubes 

Express cubes [Dal91] are a hybrid between a tree-structure and a /,-ary-//-cube (See Fig¬ 
ure 3.12). By placing interchange switches periodically in a /,-ary-//-cube, the switching delay can 
be reduced from 0( \fN) to 0(log(iV)). Done properly, the transit latency remains ®(\/N). If 
we allow several different kinds of switching elements in the network, the size of each switching 
element can be limited to a fixed size. 

3.1.8 Summary 

Table 3.1 summarizes the major characteristics of the networks reviewed here. Asymptotically, 
at least, we see that fat trees and express cubes have the slowest growing transit and switching 
latencies while maintaining the slowest resource growth. For a limited range of network sizes, 
flat multistage networks and /,'-ary-//-cubes may offer reasonable, or even superior, performance at 
reasonable hardware costs. 

3.2 Wire Length 

In this chapter, we have introduced many networks which have wires whose length is a function 
of the network size. We call a long wire any single run of wire between two switches which has a 
transit time in excess of the rate at which we could otherwise clock data between the switches. If 
we required the data to traverse any such wires in a single clock cycle, we would have to increase 
the clock period to accommodate the longest wire in the system. The longest wires in many of 
these network will be Q( \/N ) due to spatial constraints in three-dimensions. Requiring data to 


29 











Shown above is a portion of an express mesh after [Dal91]. The components labelled with 
an I are interchange units which allow connections to be routed along express channels, 
thereby bypassing intermediate switching nodes. 


Figure 3.12: Express Cube Network - k = 2 


Network 

T 

± s 

T t 

Locality 

Resources 

Practical Drawbacks 

Fully Connected 

0 

®(N) 

- 

0(iV 3 ) 

Node size ~ 0( N ) 

Distributed Crossbar 

&(N) 

®(N) 

some 

0 (iV 2 ) 1 


Flypercube 

01 log! A ! ! 


yes 

HUB 

Node size <s/- 0(log( N )) 

/,--ary-//-cube 

®(Vn) 

®(\/N) 

yes 

0 (iV) 


Flat Multistage 

01 log! A ! ! 

0( \/~N ) 

no 



Fat-Tree 

©( log( JV)) 

0! v£\ ! 

yes 

0 ( A ! 


Express Cube 

©( logf.iV )) 

~®(WY 

yes 

®(N) 



Table 3.1: Network Comparison 


traverse these wires in a single clock cycle would require our clock period to increase comparably 
with network size. Flowever, if we pipeline multiple bits on the long wires, we do not have to adjust 
the clock frequency to accommodate long wires. Our notion of transit latency as proportional to 
interconnection distance (Equation 2.2), will still hold. Instead of being a continuous equation as 
given, it becomes discretized in units of the clock period, t c . 


r dj_~ 

r ' = £ t 

i G 


(3.1) 


Equation 3.1 explicitly breaks the total distance into segments (d;) between each pair of switching 
elements in the path between the source and destination nodes to properly account for the effects of 
this discretization. Techniques for ensuring correct operation when bits are pipelined on the wires 
are detailed in Section 6.9. 


30 

















































3.3 Fault Tolerance 


In order to achieve fault tolerance in the network, we need multiple, distinct paths between any 
pair of nodes. The more distinct paths our network supports, the more robust the network will be 
to faults occurring in the network. In this section, we look at the multipath nature of the practical 
low-latency networks identified in the previous section. 

3.3.1 Indirect Routing 

If we allow indirect routing, all of the networks examined in this chapter have multiple paths. 
With indirect routing, a message may be routed from a source to a destination node by first routing 
the message through one or more intermediate nodes in the network. That is, when the source 
cannot reach the destination directly through the network, it is often possible for it to reach another 
processing node in the network which can, in turn, reach the desired destination node. If we allow 
arbitrary indirect hops through the network, any message can eventually be routed as long as the 
transitive closure of the non-faulty direct interconnect covers all the nodes in use in the network. 

While indirect routing will allow messages to eventually reach their destination, they do so at 
an increase in latency. Latency increases due to several effects. First, since messages must cross 
the network multiple times. Additional overhead is generally required to allow indirection and 
process messages requiring re-routing. Also, contention latency is increased since each indirected 
message consumes network bandwidth on each hop through the network. 

3.3.2 k-ary-n -cubes and Express Cubes 

Direct, cube-based networks, like the fc-ary-??-cube or the express cube, function by indirect 
routing. Each node is connected to 0( k ) neighbors in a regular pattern and all routing is achieved 
by sending the message to a neighbor node which, generally, moves the message closer to the 
desired destination. At every hop, the message has a choice of paths to take to the destination, 
many of which would require the same transit and switching latency. The underlying network thus 
provides the requisite multiple paths. It is then up to the routing algorithm to efficiently utilize 
them. If our routing algorithm is omniscient about faults in the network, it can always find the 
shortest path between points in a faulty network. For many faults, the length of the shortest paths 
between close nodes will increase. Flowever nodes which are further apart will see no increase in 
transit or switching latency. The more distant two nodes are from each other, the more minimum 
length paths there will be between them. 

3.3.3 Multiple Networks 

A simple technique for adding adding fault tolerance to a network which works for all kinds 
of networks is to simply replicate a base network. We give each node a connection to each of the 
networks. As long as there is a non-faulty path on some network between any pair of nodes which 
must communicate, normal communication may occur with no degradation in switching or transit 
latency. The originating node need only choose which network to use for each message it needs to 
deliver. Additionally, the existence of multiple networks increases the bandwidth available in the 
network and hence can reduce contention latency if utilized efficiently. Unfortunately, the gain in 


31 




Two four-stage networks connecting 16 endpoints are attached together at the endpoints. 
Each component is a 2 x 2, dilation-1 crossbar. 


Figure 3.13: Replicated Multistage Network 


fault-tolerance is small compared to the costs. Each additional path through the network requires 
that we construct a complete copy of the original network. Multiple, multistage style networks are 
used in the telecommunications field to minimize contention and increase available bandwidth over 
single-path networks [FIui90]. Figure 3.13 shows a 2-replicated bidelta network. 

Replicated networks do have one advantage over pure indirect routing schemes including most 
cube style networks. With multiple networks, each node does have multiple connections both to and 
from the network. As noted in Section 2.1.2 multiple network i/o connections are key to avoiding 
a single point of failure which may sever a node completely from the interconnection network. 

3.3.4 Extra-Stage, Multistage Networks 

When using multistage interconnection networks one can construct extra-stage networks with 
more switching stages than are actually required to uniquely specify a destination ([LP83], [CYFI84] 
et. al.) (See Figures 3.10 and 3.14). The set of routing specifications that reach the same physical 
destination defines a class of equivalent paths. So long as one path of each such class remains intact 
in a faulty extra-stage network, any endpoint will be able to successfully route to its destination. The 
extra stages in these schemes result in larger switching and transit latencies than the corresponding 
baseline network, even in the absence of faults. 

If extra stages are added, but the single connection into and out-of each node is retained, extra¬ 
stage networks retain a single-point of failure where the nodes connect to the network. To eliminate 


32 




Figure 3.14: Extra Stage Network 


this problem, the extra-stage network should be constructed such that multiple network endpoints 
can be assigned to each node of the network. 

3.3.5 Interwired, Multipath, Multistage Networks 

Multibutterfly style, multipath networks are multistage networks which use dilated crossbar 
routing components. In addition to being characterized by the radix of the switching element, each 
dilated crossbar router is characterized by its dilation, d. The dilation is the number of logically 
equivalent outputs in each distinct direction. With a dilation greater than one, redundant routing is 
provided in each routing direction. Figure 3.11 shows an example of such a network. Figure 3.15 
shows some configurations for the dilated routing elements used in Figure 3. IF 

This class of multipath networks has a large number of distinct paths between each pair of 
nodes. The number of different switches in a stage which can be used to route between any pair of 
routers increases toward the center of the network. Up to the center of the network, the number of 
routers in any path grows by a factor of the dilation with each successive stage. Past the center of 
the network, the sorting function performed by the network limits the number of routers in the path 
to the desired destination. For those later stages, all routers which are in the path to the destination 
are candidates for use in routing any connection. 

For a given number of node connections, the multibutterfly style networks generally have more 
paths than the comparable replicated network. Consider a k-replicated network. A multipath 
network can be constructed from the k-replicated network by taking each of the k routers in the 
same location in each of the k-replicated network and creating one dilated router out of them with 
dilation, d = k. This will give us a multibutterfly style network. Note that in the replicated network, 
we were only able to chose which resources to use when the message entered the network. In the 
multibutterfly network we have the option of switching between networks at each routing stage. 
Thus, there are many more paths through the multibutterfly networks. The fine details of how one 
wires these redundant paths are discussed in Section 3.5. 


33 



Dilated Crossbar (no connections) 


Logically Equivalent Connection Pairs 


Figure 3.15: 4 x 2 Crossbar with a dilation of 2 


By constructing fat-tree networks using dilated crossbar routers, it is possible to build multipath, 
fat-tree networks which exhibit the same basic properties. The tree networks will need multiple 
connections into and out-of the network to avoid single points of failure. Connections made through 
higher tree-levels have more paths between the source and the destination as they traverse more 
dilated routers in the network. 

3.4 Robust Networks for Low-Latency Communications 

Given our need for fault tolerance and low latency, the classes of networks which are most 
attractive are express cubes and multipath, fat-tree networks. For smaller networks, k-ary-n -cubes 
and flat multipath, multistage networks are also worth considering. Because of the acyclic nature 
of multistage routing networks, it is easier to devise robust and efficient routing schemes for this 
class of networks. Consequently, we will focus on multistage networks for the remainder of this 
document. 

3.5 Network Design 

This section discusses many of the issues relevant to designing a high-performance, robust, 
multipath, multistage routing network. The space of possible multipath networks is quite large, and 
some of the decisions made when selecting a particular network can make a significant difference 
in the fault tolerance and performance of the network. In addition to the basic parameter selection, 


34 



N total number of nodes on the network 
ni input ports from each node to the network 
no output ports from each node to the network 
i input ports per router 
o output ports per router 
r router radix 
d router dilation 
w channel width 


Table 3.2: Network Construction Parameters 

the detailed network wiring scheme can have a notable affect on the performance of the resulting 
network. Many of the wiring issues are easier to describe and understand using small, flat, multipath, 
multistage interconnection networks. As a result, the examples and development which follow are 
given in terms of this class of networks. Nonetheless, the same design principles apply when 
developing multipath, fat-tree networks. 

3.5.1 Parameters in Network Construction 

Table 3.2 summarizes several parameters which will be used in this section when characterizing 
a network. Radix and dilation were introduced in Sections 3.1.5 and 3.3.5. ni and no quantify the 
number of connections between each node and the network, i and o are the number of connections 
in and out of each router. Generally, i = o = r ■ d. Since the number of inputs and the number of 
outputs on the routing components are the same, we say the routers are square. When we use square 
routers, the aggregate bandwidth between stages in flat, multistage networks remains constant. 

3.5.2 Endpoints 

The network endpoints are the weakest link in the network. If we are designing a network with 
a yield model in mind, in the worst case, we can sustain only min( ui. no ) faults. If we are designing 
a network with a harvest model in mind, in the worst case each mini ui. no) faults will remove an 
additional node from the operational set. 

Once ni and no are chosen, we must also ensure that these connections are utilized effectively. 
Particularly, to maximize robustness, each must link connect to a distinct routing component in the 
network. Note, for instance, in the network shown in Figure 3.11, that dilation-1 routers are used in 
the final stage of the network. These dilation-1 routers are used to achieve maximal fault tolerance 
by ensuring that the maximum number, no = 2, of distinct routers provide output connections from 
the network to each node. Figure 3.16 shows another alternative for using dilation-1 routers in the 
final stage. Rather than using d times as many routers with dilation-1 and the base radix unchanged, 
the network in Figure 3.16 uses routers which increase the radix by a factor equal to the dilation 

(i.e. V final-stage — f> — 1 d'). 


35 





Figure 3.16: 16 X 16 Multibutterfly Network with Radix-4 Routers in Final Stage 

3.5.3 Internal Wiring 

Inside a multipath network, we have considerable freedom as to how we wire the multiple 
paths between stages. As described in Section 3.1.5, multistage networks operate by successively 
subdividing the set of potential destinations at each stage. All inputs to routing components in 
the same equivalence class at some intermediate network stage are logically equivalent since the 
same set of destinations can be reached by routing through those components. If we exercise this 
freedom judiciously, we can maximize the fault-tolerance and minimize the congestion within the 
network, and hence minimize the effects of congestion latency. 

Path Expansion 

A simple heuristic for achieving a high degree of fault tolerance is to wire the network to 
maximize the path expansion within the network. That is, we want to select a wiring which allows 
the connection between any two endpoints to traverse the maximum number of distinct routing 
components in each stage. Maximizing path expansion improves fault-tolerance by maximizing 
the redundancy available at each stage of the network. 

Let S be the total number of routing stages in the network. The number of paths between a 
single source-destination pair expands from the source into the network at the rate of dilation, d. 
Thus, we have p tn (s ), the number of paths to stage s given by Equation 3.2. 

= B! X (3.2) 

After a stage in the network, the paths will have to diminish in order to connect to the proper 
destination. Looking backward from the destination node, we see that the paths must grow as the 
network radix r. This constraint is expressed as follows: 

Pout(s) = no x (3.3) 


36 










5 

1 

2 

3 

4 

5 

p(s) 

2 

4 

8 

4 

2 


Table 3.3: Connections into Each Stage 


These two expansions must, of course, meet at some point inside the network. This occurs when 
Pin and p out are equal. Let us call this turning point stage s'. s' can be determined as follows: 

Pou.t(s ) — Pin(s ) 

ni X = no X r^ s+1 ) -s, l 

, (,5' + 1) • ln( r ) + ln( no ) + In (d) — ln( ni ) 

ln( d ) + ln( r ) 

( .V . I i • In (/•) • In( —) 

’ = - ln(f/ - r) GA) 

Once Equation 3.4 is solved for s', we can quantify the number of connections into each stage of 
the network by Equation 3.5. 

{ ni X s < s' 

min (ni ■ no ■ r(( s+1 ) _s l ) s = s' (3.5) 

no X r[( s + 1 )-*] ,s > .P 

Note that Equation 3.5 expresses the maximum achievable number of paths between stages for a 
single source-destination pair. This is effectively an upper bound on the path expansion in any 
dilated multipath network. The total number of distinct paths between each source and destination 
simply grows as Equation 3.2 and is thus given by Equation 3.6. 

Ptotai(s) = ni X (3.6) 

For example, consider the network in Figure 3.11 (ni = no = r = d = 2, S = 4). Solving 
Equation 3.4 for .s', we find s' = 3. The number of connections into each stage can then be 
calculated as shown in Table 3.3. The total number of paths is simply 2 x 2 3 = 16. Noting 
Figure 3.11, we see it does achieve this maximum path expansion for the highlighted path; the 
paths between all other source and destination pairs in Figure 3.11 also achieve this path expansion. 

a-/3 Expansion 

Unfortunately, path expansion can be a naive metric when optimizing the aggregate fault- 
tolerance and performance of a network. Path expansion looks at a single source-destination pair 
and tides to maximize the number of paths between them. If we only considered path expansion in 
selecting a network design, many nodes could share the same sets of routers and connections in their 
paths through the network. This sharing would lead to a higher-degree of contention. Additionally, 
when faults accumulate in the network, a larger number of nodes are generally isolated from the 


37 





Figure 3.17: Left: Non-expansive Wiring of Processors to First Stage Routing Elements 
Figure 3.18: Right: Expansive Wiring of Processors to First Stage Routing Elements 


rest of the network at once. Consider, for instance, the two first stage network wirings shown 
in Figure 3.17 and 3.18. Both wirings are arranged such that each processor connects to two 
distinct processors in the first stage of routing. Flowever, the wiring shown in Figure 3.17 has four 
processors which share a pair of routers, whereas any group of four processors in the wiring shown 
in Figure 3.18 is connected to five routers in the first stage. As a result, there will generally be less 
contention for connections through the first stage of routers in the latter wiring than in the former. 


38 








Leighton and Maggs introduced a-ji expansion to formalize the desirable expansion properties 
as they pertain to groups of nodes which may wish to communicate simultaneously [LM89]. 
Informally, a-ji expansion is a metric of the degree to which any subset of components in one stage 
will fan out into the next stage. More formally, we say a stage has a-ji expansion (a,ji) if any 
subset of a components from one stage must connect to at least a X ji components in the next stage. 
i is thus an expansion factor which is guaranteed for any set of size a. Networks with favorable 
a-ji expansion are networks for which the a-ji expansion property holds with higher ji for each 
value of q. The more favorable the a-ji expansion, the more messages can be simultaneously 
routed between any sets of communicating processors, and hence the lower the contention latency. 

Networks Optimized for Yield 

If we cannot tolerate node loss, and hence wish to optimize the fault-tolerance of the network 
as a yield problem, then it makes sense to focus on achieving the maximal path expansion first, 
then achieving as large a degree of a-ji expansion as possible. Unfortunately, there is presently no 
known algorithm for achieving a maximum amount of a-ji expansion, so the techniques presented 
here are heuristic in nature. 

To achieve maximum path expansion, we connect the network with the algorithm listed in 
Figure 3.19 [CED92], The paths from any input to any output may fanout by no more than a factor 
of d, the dilation of the routers, at each stage. This fanout may also become no larger than the size of 
the routing equivalence classes at that stage. The routine groupsz returns the maximum fanout size 
allowed by both of these factors. Each stage is partitioned into fanout classes of this size, which 
are then used to calculate network wiring. The maximum path fanout described in Equation 3.5 is 
achieved by this algorithm for all pairs of components. 

As introduced above, the last stage is composed of dilation-1 routers to increase fault tolerance. 
Figure 3.20 shows a deterministically-interwired network composed of radix-2 routers. 

Networks Optimized for Harvest 

To achieve a high harvest rate and maximize performance, we want to wire networks with a high 
degree of a-ji expansion. As introduced above, there are no known deterministic algorithms for 
achieving an optimal expansion. In practice, randomized wiring schemes produce higher expansion 
than any known deterministic methods. [Kah91] presents some of the most recent work on the 
deterministic construction of expansion graphs. [Upf89] and [LM89] show that randomly wired 
multibutterflies have good expansion properties. The high expansion generally means there will 
be less congestion in the network. Additionally, Leighton and Maggs show that after k faults have 
occurred on a N node machine, it is always possible to harvest N — 0{k) nodes [LM89]. 

As introduced in Section 3.1.5, multistage networks operate by successively subdividing the set 
of potential destinations at each stage. All the inputs to routing components in the same equivalence 
class at some intermediate stage in the network, are logically equivalent. After the routing structure 
determines which set of outputs in one stage must be connected to which set of inputs in the 
following stage, we randomly assign individual input-output pairs within the corresponding sets. 
Figure 3.21 shows the core of an algorithm for randomly wiring a multibutterfly. The algorithm was 
first introduced in [CED92] and is based on the wiring scheme described in [LM89]. In practice, 


39 



> Returns the next-stage router to which to wire for maximum path expansion 
wire_to_port( /)//,, ,.s) 

> /i=router number, d p =dilated port number, s=router stage 

1 outgrpsii <— groupsz(s) 

2 ingrpsz — groupsz(s + 1) 

3 eqjstart ingrpsz x \_n/(r x dx ouigrpsz)\ 

> offset to beginning of fanout class 

4 eqj’outer ((n x d, + d. p ) mod ingrpsz) 

> offset to specific chip within fanout class 

5 retnm(i + eqj'outer) 

> Calculates size of fan-out class 
groupsz(s) 

1 expansion — ni x d s+1 > maximum fanout due to dilation 

2 eqjclass no x j- s +i-s [> equivalence class size 

3 return(min(e,cpans*on, eqjclass )) 


This algorithm generates a network designed to maximize path expansion. Each endpoint 
will have the maximum number of redundant paths possible through this type of network 
(boundary cases omitted for clarity). 


Figure 3.19: Pseudo-code for Deterministic Interwiring 



Figure 3.20: 16 X 16 Path Expansion Multibutterfly Network 


40 











> inset contains all the input ports of a single equivalence class in the next stage. 

> connections is an array matching imports and outsorts, initially empty. 

> oiit.ports .list lists the output ports of a single equivalence class in the 

> current stage. 

wire_eq_dass(m_sef, connections, out.portsJist ) 

1 foreach out.port 

2 import — choose and remove a random input port from inset 

3 while(connected(router#(m_po/7), router#) out.port), connections)) 

4 put in.port back in inset 

5 import choose and remove a random input port from inset 

6 connect (import, out.port, connections) 

7 return&onnerfions) 

connected(imr outer, out.r outer, connections.array) 

1 if in jr outer is already connected to out.router 

2 return(true) 

3 else return(f alse) 


This algorithm randomly interwires an equivalence class. To interwire a whole stage, the 
algorithm is repeated for each class (boundary cases omitted for clarity). 


Figure 3.21: Pseudo-code for Random Interwiring 


one would generate many such networks, compare their performance as described in Sections 3.5.4 
and 3.5.5, and pick the best one. Experience indicates that most such networks perform equivalently. 
The testing, however, assures that one avoids the unlikely, but possible, case in which a network 
with poor expansion was generated. Figure 3.22 shows a network constructed with this algorithm. 

Hybrid Network Compromise 

Chong observed in [CK92] that one can achieve maximum path expansion while introducing 
some randomized expansion to minimize congestion. The result is a network which is a hybrid 
between the two described above. The basic strategy used in wiring such, randomized, maximal- 
fanout networks is to further subdivide each routing equivalence class into fanout classes. Instead of 
randomly wiring from all outputs destined for a given equivalence class to the inputs on all routers 
in that equivalence class in the subsequent stage, the dilated outputs from each router are each 
sent to different fanout classes within the appropriate routing equivalence class (See Figure 3.23). 
Figure 3.24 sketches the algorithm used for wiring up these networks. Figure 3.25 shows an 
example of such a network. 


41 





A randomly-interwired, four-stage network connecting 16 endpoints. Each component in 
the first three stages is a 4 x 2, dilation-2 crossbar. To prevent any single component from 
being in an endpoint’s critical path, the last stage is composed of2 x 2, dilation-1 crossbars. 


Figure 3.22: Randomly-interwired Network 

3.5.4 Network Yield Evaluation 
Yield 

As a simple metric for evaluating the yield characteristics of these multipath networks, we 
consider the probability that a network remains completely connected given a certain number of 
randomly chosen router faults. These Monte Carlo experiments model only complete router faults 
to show the relative fault-tolerant characteristics of these networks while containing the size of the 
fault-space which must be explored. 

The experiment proceeds by placing one randomly chosen fault at a time until the network 
becomes incomplete. The basic process is repeated on the same network for enough trials to 
achieve statistically significant results. Results are tabulated to approximate the probability of 
network completeness for each fault level. We also derive the expected number of faults each 
network can tolerate. 

Because the routing components in the final stage of our multipath networks are half the size 
of routers in the previous stages, we assign two such routers to one physical component package 
and label both routers faulty if the physical component is chosen to be faulty. Furthermore, the 
two routers are assigned so that removing any such pair will not cut off an endpoint. We make 
this assignment so that fault increments will be of constant hardware size. This assignment also 
simulates how the pair of 4 x 4, dilation-1 routers in an RN1 routing component (See Chapter 8) 
may be assigned. 

We generated three-stage and four-stage networks for each of the types of networks described 


42 



















The above figure shows how to achieve maximal fanout while avoiding regularity. The 
routers shown are radix-2 and dilation-2. At stage s, we divide each routing equivalence 
class into n i ■ S s ~ 11 fanout classes until each fanout class contains a single router. Random 
wirings are chosen between appropriate fanout classes to form fanout trees. The disjoint 
nature of fanout classes ensures that fanout-trees will have physically distinct components. 


Figure 3.23: Randomized Maximal-Fanout (diagram from [CK92]) 


above, each connecting 64 and 256 endpoint nodes respectively. Each endpoint has two connections 
to and from the network (ni = no = 2) to provide for the minimal amount of redundancy necessary 
to achieve fault tolerance. Every network uses radix-4 routers of dilation-2 and dilation-1 and 
hence could be implemented using the RN1 component. All the networks with a given number of 
stages contain the same number of components. Network wiring is solely accountable for the fault 
tolerance and performance differences of these networks. 

For each network, the yield probability of the network is plotted against the number of uni- 


43 





wire_stage(s) > s=routing stage 

1 prev.expansion — ni x d s+1 > maximum fanout to stage s due to dilation 

2 prevxqxlass no x r ((,5+1 > _s > > equivalence class size at stage s 

3 prev .fanout.class * - prev.eq.dass > f an0U ( c i ass size at stage s 

r J prev -expansion ° 

4 expansion — n i x d (s+2) > maximum fanout to stage s + 1 due to dilation 

5 eqxlass no x r ((,5 + 1 )_(s+1)) > equivalence class size at stage s + 1 

6 fanoui.class * - eq-dass f anou t class size at stage s + 1 

7 if (fanout-dass > 1) 

8 foreach fanout equivalence class in stage s 

9 create (r x d) different output-port lists, 

one for each output from a routing switch 

> each of these lists will contain prev.fanout.class ports 

10 foreach output-port list identified, identify the fanout xlass 

routers in stage s + 1 to which these ports should be 
connected - the inputs on these routers make up the 
corresponding in-port list 

11 Use wire_eq_class to randomly interconnect each in-port list 

to each corresponsding output-port list 

12 else 

13 foreach equivalence class in stage s 

14 create r different output-port lists, 

one for each logically distinct output direction from a router 

> each of these lists will contain (prev.eq.class x d) ports 

15 foreach of the output-port lists identified, identify the eq xlass 

routers in stage s + 1 to which the output list should be 
connected - the inputs on these routers make up the 
corresponding in-port list 

16 Use wire_eq_class to randomly interconnect each in-port list 

to each corresponsding output-port list 

This algorithm describes how to wire random, maximal-fanout networks using the random 
interwiring algorithm, wire_eq_class shown in Figure 3.21 (boundary cases omitted for 
clarity). 

Figure 3.24: Pseudo-code for Random, Maximal-Fanout Interwiring 


44 





Figure 3.25: 16 X 16 Randomized. Maximal-Fanout Network 




(A) 3-Stage Network Completeness (B) 4-Stage Network Completeness 

The probability that a network with a given number of faults is complete for the randomly- 
interwired, path expansion, and random maximal fanout, 3-stage and 4-stage networks. (A) 
Each 3-stage network uses 48 radix-4 components to interconnect 64 endpoints. (B) Each 
4-stage network uses 256 radix-4 components to interconnect 256 endpoints. 


Figure 3.26: Completeness of (A) 3-stage and (B) 4-stage Multipath Networks 


formly distributed random faults. Results for the three-stage and four-stage networks are shown 
in Figure 3.26. The expected number of faults that each network can tolerate is summarized in 
Table 3.4. 


45 



















Ne 

Stages 

:twork 

Wiring 

Total 
# Comp. 

Test 

Trials 

Expect 

To 

# Faults 

ed Failure 

erated 

% Network 

Error 

Bound 

# Faults 

3 

Random 

48 

1000 

5.0 

10% 

0.063 

3 

Path Expansion 

48 

1000 

8.1 

16% 

0.079 

3 

Random Max Fanout 

48 

1000 

5.2 

11% 

0.060 

4 

Random 

256 

5000 

11.8 

4.6% 

0.075 

4 

Path Expansion 

256 

5000 

22.6 

8.8% 

0.130 

4 

Random Max Fanout 

256 

5000 

12.5 

4.9% 

0.069 


The above table shows the expected number of faults each network can tolerate while 
remaining complete. Each network was fault tested as described in section 3.5.4 for the 
indicated number of trials. 


Table 3.4: Fault Tolerance of Multipath Networks 

Wiring Extra-Stage Networks for Fault Tolerance 

It is worth noting that we can achieve the same fault tolerance as indicated in this section 
without using dilated routers. Consider replacing each of the dilated routers used in the networks 
above with an equivalently sized (i.e. same number of inputs, i, and same number of outputs o) 
dilation-1 router (i.e. r = o, d = 1). The network we end up with is an extra-stage network since 
we have increased the radix while leaving the number of stages the same. Form a fault tolerance 
perspective, this resulting extra-stage network has the same yield probability as the corresponding 
dilated network. As a result, the network wiring issues introduced in Section 3.5.3 apply equally 
well to extra-stage, multistage networks as they did to dilated, multistage networks. 

Performance Degradation in the Presence Faults 

We are also interested in knowing how robust the network performance is when faults accumu¬ 
late. To that end, we consider a simple synthetic benchmark on the complete networks at various 
fault levels. This gives us some idea of the effects of congestion in the network, as well as how the 
faults affect the overall performance of the network. The routing protocol detailed in Chapter 4 is 
used for all of these simulations. 

Our synthetic benchmark, FLAT24, was designed to be representative of a shared-memory 
application. FLAT24 uses 24-byte messages with a uniform traffic distribution. FLAT24 generates 
0.04 new messages per router cycle based on the assumption that the network is running at twice 
the clock rate of the processor and a data-cache miss rate of 15%. The application is assumed to 
barrier synchronize every 10,000 cycles, or every 400 messages. Modeling barrier synchronization 
exposes the effects of localized degradation. If a small number of nodes have significantly fewer 
paths through the network than the rest of the nodes, the nodes with less connectivity will fall 
behind those with more. In a real application, these nodes will tend to hold up the remainder of the 
application since they are not progressing as rapidly as the rest of the nodes in the network. The 
periodic barrier synchronization is a simple and pessimistic way of limiting the extent to which 


46 





(A) 3-Stage Network I/O 



(C) 4-Stage Network I/O 



(B) 3-Stage Network Latencies 



(D) 4-Stage Network Latencies 


Comparative I/O bandwidth utilization and latencies for 3-stage and 4-stage random and 
path expansion networks on FLAT24. Recall from Table 3.4 that expected percentages of 
failure tolerated by random and deterministic networks are, respectively: 10% and 16% for 
3-stages; and 4.6% and 8.8% for 4-stages. Note that the performance degradation appears to 
level off because only complete networks are measured. Although the surviving networks 
suffer less degradation as percentage of failure increases, the number of surviving networks 
is becoming substantially smaller. 


Figure 3.27: Comparative Performance of 3-Stage and 4-Stage Networks 


nodes may get ahead of each other and hence exposing the effects of this localized degradation. 
This synthetic application and the simulations in general are described in detail in [Cho92]; most 
relevant details are reprinted in Appendix A. 

Figure 3.27 shows the performance degradation of FLAT24 on the surviving networks as various 
fault levels. Flere latency is the average time from when a message is injected into the network until 
the time its reply and acknowledgment are received. I/O bandwidth utilization measures the average 
fraction of network outputs which are receiving or replying to successful message transmissions at 
any point in time. This provides a measure of the useful bandwidth provided by the network. 


47 



1 Begin with all nodes live. 

2 Determine the I/O-isolated nodes and remove them from the 

set of live nodes. 

3 Each faulty chip leading to at least one live node is declared to be blocked. 

Propagate blockages from the outputs to the inputs according to 
the definition of blocking given below. 

4 If all of a node’s connections into the first stage of the network 

lead to blocked chips, remove the node from the set of live nodes. 


This algorithm harvests the nodes in a networks which retain good connectivity in the 
presence of faults. The algorithm will sacrifice nodes which still retain weak connectivity 
in order to maximize the performance of the harvested network. 

> A router is said to be blocked if it does not have at least one unused, operational output 
port in each logical direction which leads to a router which is not blocked. 

> An I/O-isolated node is a node which has lost all of its input connections to the first stage 
of the network or all of its output connections from the final stage of the network. 


Figure 3.28: Chong’s Fault-Propagation Algorithm for Reconfiguration 

3.5.5 Network Harvest Evaluation 

To evaluate the harvest rate of a network with faults, we use the reconfiguration algorithm 
suggested by Chong in [CK92], This reconfiguration algorithm identifies all nodes with “good” 
network connectivity. The algorithm does not necessarily identify all nodes which retain full 
connectivity in the network as available in the harvested network. Since it is the overall system 
performance that matters, not simply the number of nodes available for computation, Chong ob¬ 
serves that better overall performance is achieved when nodes with low bandwidth into the network 
are eliminated from the set of nodes used for computation. Chong’s algorithm is summarized in 
Figure 3.28. 

Figure 3.29 shows the harvest rate for a 5-stage, radix-4, dilation-2 (1024 node) network. Also 
shown is the degradation in application performance assuming that the application can be efficiently 
repartitioned to run on the surviving processors. 

3.5.6 Trees 

Fat-trees have the same basic multipath, multistage structure as the multistage networks de¬ 
scribed so far in this section. It is easiest to think of each fat-tree network as two sub-networks. 
One sub-network routes from the root of the tree down to the leaves. This portion looks almost 
identical to the routing performed by the multistage networks that have been discussed. Particularly, 
this downward routing network performs the same recursive subdivision of possible destinations 
at each successive routing stage. The other sub-network allows connections to be routed up to 
the appropriate intermediate tree level and then cross over into the down routing sub-network. 
In fact, we could think of the flat, multistage networks as a tree which had a degenerate up and 
crossover sub-network. In these networks, the up network is simply set of wires which connect all 


48 






Percent Network Failure Percent Network Failure 

(A) (B) 

Figure (A) shows the percentage of node loss under the criterion of fault-propagation. Fig¬ 
ure (B) compares the performance of the randomly wired multibutterfly with the randomized 
maximal fanout network. 


Figure 3.29: Fault-Propagation Node Loss and Performance for 1024-Node Systems (from [CK92]) 


network input connections directly into the root of the tree. It is the upward routing portion of the 
free networks which give them their ability to exploit locality. Two nodes close to each other can 
cross over low in the tree structure and avoid traversing a large number of routers or consuming 
bandwidth near the root of the tree. 

Fat-Trees 

Fat-trees are distinguished from arbitrary tree based networks in that the interconnection band¬ 
width increases towards the root of the tree. The internal tree connections closer to the root require 
more bandwidth because they service a lai gcr number of nodes below them. For instance, in a 
binary fat-tree the root of the tree will see all traffic that is not constrained solely to either half of 
the machine. The property that makes fat-tree structures most attractive is their universality prop¬ 
erty. Leiserson shows that, when the rate of bandwidth growth in the fat-tree is chosen properly, 
fat-trees can be volume universal. That is, a properly constructed volume 0(F) fat-tree network 
can simulate any volume 0(F) network in polylogarthmic time [Lei85] [Lei89] [GL85]. 

The key observation in demonstrating the universality of various fat-tree structures, is that the 

physical world places constraints on the ratio between the volume of a region and the wire channel 

capacity, and hence bandwidth, which can efficiently enter or leave that volume. The channel 

capacity into a volume is limited by the surface area surrounding that volume. As we scale up 

to huger systems and hence larger volumes, the surface area of a given volume, F, grows only 
2 2 
as 0(F5). To remain volume efficient, the channel capacity can only grow as @(F|). If the 

channel capacity grows faster than this, then the size of the system packaging is limited by the 

channel capacity between regions rather than the volume of the system being packaged. As the 

system becomes large, pieces of the system must be placed further apart due to the interconnection 

bandwidth constraints. As a result, the universality property will not hold because the number of 


49 



processors per unit volume is decreasing as the system increases in size. If the channel capacity 
grows slower than this, the universality property does not hold due to insufficient channel capacity 
to support the potential message traffic. For binary fat-trees Leiserson shows that channel capacity 
should increase as per stage toward the root of the tree in order to achieve volume universality. 

Fat-trees also have considerable flexibility. When other pragmatic issues dictate a structure 
that allows more channel capacity, and consequently more bandwidth, at higher tree levels than 
is appropriate for volume-universality, the basic fat-tree structure can accommodate the increased 
interstage capacity. This additional channels will allow additional fault tolerance and lower the 
network’s contention latency, 7. 

Building Fat-Trees 

We can build fat-tree networks with the same fixed-size, dilated routers which we have used to 
construct flat, multistage networks. The use of such routers in the down sub-network is obvious 
since the down sub-network performs the same sorting function as in the flat networks. Here, 
the router radix defines the arity of the fat-tree. The up routing sub-network needs to expand the 
possible destinations so that a given route may make use of a large portion of the bandwidth at 
some higher tree stage. The up routing sub-network also needs to provide switching which allows 
periodic crossover to the down routing network. At the same time, the bandwidth between tree 
levels needs to be controlled to match the application requirements as described in the previous 
section. Just as with the flat-multistage networks, the endpoint connections are weak links and one 
generally wants to organize networks with multiple network connections per endpoint. Similarly, 
the issues of wiring the internal stages for fanout apply equally well here. 

As an example, consider building a fat-tree using radix-4, dilation-2 routing components. The 
down sub-networks is a quaternary tree. In the up sub-network, we use the routing components to 
switch between upward routing and crossover connections into the down sub-network. We can take 
advantage of the radix-4 switching provided by the routing component to route to several crossover 
connections at a single switching stage. As a result, we effectively create short-cut paths in the up 
routing tree. Figure 3.30 shows how a radix-4 up router can switch to three successive tree-stages 
and provide upward connection in the tree. Since each up router in the up sub-tree services three 
down-tree stages, the route to the root is only \ log 4 A" long. Figures 3.31 and 3.32 shows the 
logical connectivity for the up and down sub-trees using the short-cut crossover scheme shown in 
Figure 3.30. 

3.5.7 Hybrid Fat-Tree Networks 

Fat-trees allow us to exploit a considerable amount of locality at the expense of lengthening the 
paths between some processors. Flat, multistage networks fall at the opposite extreme of the locality 
spectrum where all nodes is uniformly close or distant. Another interesting structure to consider is 
a hybrid fat-tree. A hybrid fat-tree is a compromise between the close uniform connections in the 
flat, multistage network and the locality and scalability of the fat-tree network. In a hybrid fat-tree, 
the main tree structure is constructed exactly as described in the previous section. However, the 
leaves of the hybrid fat-tree are themselves small multibutterfly style networks instead of individual 
processing nodes. With small multibutterfly networks forming the leaves of the hybrid fat-tree, 


50 



(To next Up router) 



(From previous Up router) 
(or processing nodes) 


(To other down routers) 
(or processing nodes) 


Shown above is a cross-sectional view of a fat-tree network showing a switching node in the 
up routing sub-tree and a down router in each of the three successive tree stages to which 
this up router can crossover. As shown, each router is a radix-4 routing component. Only 
a single output is shown in each logical direction for simplicity. With dilated routers, each 
dilated connection would be connected to different routers in the corresponding destination 
direction for fault tolerance. 


Figure 3.30: Cross-Sectional View of Up Routing Tree and Crossover 


small to moderate clusters of processors can efficiently work closely together while still retaining 
reasonable ability to communicate with the rest of the network. 

The flat, leaf portion of the network is composed of several stages of multibutterfly style 
switching. Each stage switches among r logical directions. The first stage is unique in that only 
(r — 1) of the r logical directions through the first stage route to routers in the next stage of the 
multibutterfly. The final logical direction through the first routing stage connects to the fat-tree 
network. The remaining stages in the leaf network perform routing purely within the leaf cluster. 
To allow connections into the leaf cluster from the fat-tree portion of the network, one r-th of the 
inputs to the first routing stage come from the fat-tree network rather than from the leaf cluster 
processing nodes. Figure 3.33 shows a diagram of such a leaf cluster. This hybrid structure was 
introduced in [DeF190] and is developed in more detail there. 

3.6 Flexibility 

In Section 2.8, we raised some concerns about how well a network topology can be adapted 
to solve particular' applications. Flaving reviewed the properties of these networks, we can answer 
many of the questions raised. 


51 




Figure 3.31: Connections in Down Routing Stages (left) 

Figure 3.32: Up Routing Stage Connections with Lateral Crossovers (right) 



Figure 3.33: Multibutterfly Style Cluster at Leaves of Fat-Tree 


• Flow do we provide additional bandwidth for each node at a given level of semiconductor 
and packaging technology? 

If we assume that the semiconductor technology limits the interconnect speed, then we are 
Lying to increase the bandwidth in an architectural way. With both flat multipath networks 
and multibutterflies, we can easily increase the bandwidth into a node by increasing the 
number of connections to and from the network, (i.e. ni and no). This also has the side 
effect of increasing the network fault tolerance. 


52 






• How do we get more/less fault tolerance for applications which have a higher/lower premium 
for faults 

The simple answer here is to increase the number of connection to and from the network, 
since this is the biggest limitation to fault tolerance. Using higher dilation routers will provide 
more potential for expansion and hence better fault-tolerance. Hybrid schemes which use 
extra-stages in a dilated network will also serve to increase the number of paths and hence 
the fault-tolerance of the network. 

• How do we build larger (smaller) machines? 

The scalability of the schemes presented here, allow the same basic architecture to be used in 
the construction of large or small machines. For very large machines, we saw that fat-trees or 
hybrid fat-trees are the best choice. For smaller machines, we saw that multistage networks 
may provide better performance. In between, the details of the technologies involved as well 
as other system requirements will determine where the crossover lies. 

• How can we decrease latency? at what costs? 

We have control over the latency in several forms. The switching latency ( T s ) is directly 
controlled by the router radix, r. Increasing the radix of the router will lower the number 
of stages which must be traversed and tend to decrease latency. However, the router radix 
is limited by the pin limitations of the routing component. Increasing the radix will either 
require an increase in die-size and package pin count (and hence cost), or a decrease in 
dilation or data channel width. Decreasing dilation will tend to reduce fault-tolerance and 
increase congestion. Decreasing the data channel width decreases the bandwidth and thus 
increases both congestion and the message transmission time ^Ttrans0k)- By increasing the 
channel width, we can decrease transmission time; again, this will either increase die-size 
and cost, or require the decrease in radix or dilation. Finally, we can decrease congestion 
by increasing router dilation or increasing the aggregate network bandwidth. Increasing the 
dilation, again must be traded off against radix, channel width, and cost. Increasing the 
number of inputs and outputs to the network will increase the aggregate bandwidth of the 
network at the cost of more network resources. 

3.7 Summary 

In this section, we have examined network topologies suitable for implementing robust, low- 
latency interconnect for large-scale computing. We saw that express-cubes and fat-trees have the 
best asymptotic characteristics in terms of latency and growth. We also saw how the multipath 
nature of these networks allows the potential for tolerating faults within the networks. For many 
networks, we see that architectures which tolerate network fault do not necessarily require additional 
network latency. The only increase in network latency results from the lower bandwidth available in 
the faulty network. We examined detailed issues relevant to wiring multistage networks. We found 
that good performance results from wiring the network to avoid congestion and that randomized 
techniques provide the best strategy currently known for achieving such network wirings. 


53 



3.8 Areas to Explore 

We have, by no means, explored all the issues associated with selecting the optimal network 
for every application. The following is a list of a few interesting areas of pursuit: 

1. It is hard to provide a final head-to-head latency comparison between networks without a good 
quantification of the effects of congestion in various networks. As mentioned in Section 2.4, 
this is particularly difficult because the effects of congestion are highly dependent upon the 
network usage pattern needed by the application and the detailed network topology. A good 
quantification of congestion applicable across a wide range of networks and loading patterns 
would go a long way toward helping engineers design and evaluate routing networks 

2. In Section 3.5.4 we demonstrate that a class of extra-stage networks has the same fault-tolerant 
properties as dilated networks. These networks will generally have lower performance due 
to the necessity to make detailed routing decisions at the node rather than inside the network 
where the freedom can be used to minimize blocking. It would be worthwhile to quantify 
the magnitude of the performance improvement offered by the dilated routing components. 

3. Express-cubes have the same asymptotic network characteristics as fat-trees. We avoid 
detailed consideration of these networks at this point due to the difficulties associated with 
efficiently routing on such networks in the presence of faults. In the next chapter, we will 
show how to route effectively with faults for fat-tree and multistage networks. It would be 
interesting to see comparable routing solutions for express-cubes. 


54 



4. Routing Protocol 


In the previous chapter, we saw how to construct multipath networks. The organization of these 
networks offers considerable potential for low-latency communication and fault-tolerant operation. 
To make use of this potential, we need a routing scheme which is capable of exploiting the multiple 
paths with low latency. In this chapter, we develop a suitable routing scheme and show how it 
meets these needs. 

4.1 Problem Statement 

As introduced in Chapter 1, we need a routing scheme which provides: 

1. Low-overhead routing 

2. Protocol Flexibility 

3. Distributed routing 

4. Dynamic fault tolerance 

5. Fault identification and localization with minimal overhead 

4.1.1 Low-overhead Routing 

Any overhead associated with sending a message will increase end-to-end message latency. 
There are two primary forms of overhead which we wish to minimize: 

1. Overhead data 

2. Overhead processing 

Overhead data includes message headers and trailers added to the message. Overhead data will 
diminish the available network bandwidth for conveying actual message data. Overhead processing 
includes the processing which must be done at each endpoint to interact with the network ( e.g. T p , 
7 and the processing each router must perform to properly process each data stream (e.g. t sw itch)- 
Endpoint overhead processing includes: 

1. processing necessary to prepare data for presentation to the network 

2. processing necessary to use data arriving from the network 

3. processing necessary to control network operations 

We want a protocol that satisfies the various routing requirements with minimal overhead in terms 
of both processing time and transmitted data. 


55 




4.1.2 Flexiblity 

In the interest of providing general, reusable routing solutions, we seek a minimal protocol for 
reliable end-to-end message transport. Specific applications will need to use the network in many 
different ways. To allow as large a class of applications as possible the opportunity to use the 
network efficiently, the restrictions built into the underlying routing protocol should be minimized. 

4.1.3 Distributed Routing 

In the interest of fault tolerance, scalability, and high-speed operation, we want a distributed, 
self-routing protocol. A centralized arbiter would provide a potential single point of failure 
and have poor scalability characteristics. Rather, we need a routing scheme which can allocate 
routing resources and make connections efficiently in practice using only localized information. A 
distributed routing scheme operating on local information has the following beneficial properties: 

• faults only affect a small, localized area 

• routing decisions are simple and hence can be made quickly. 

4.1.4 Dynamic Fault Tolerance 

To provide continuous, reliable operation, the routing scheme must be capable of handling faults 
which arise at any point in time during operation. As introduced in Section 2.1.1, transient faults 
occur much more frequently than permanent faults. Additionally, for sufficiently long computations 
on any large machine, one or more components are likely to become faulty during the computation 
(e.g. example presented in Section 2.5). 

4.1.5 Fault Identification 

Although, a routing protocol which can properly handle dynamic faults can tolerate unidentified 
faults in the system, the performance of the routing protocol can be further improved by identifying 
the static faults and reconfiguring the network to avoid them. Fault identification also makes it 
possible to determine the extent of the faults in the system. This allows us to determine how close 
the system is to becoming inoperable. To the extent possible, the routing scheme should facilitate 
fault identification with low overhead. The faster that faults can be identified and the system 
reconfigured, the less impact the faults will have on network performance. 

4.2 Protocol Overview 

We have designed the Metro Routing Protocol (MRP) to addresses the issues raised in Sec¬ 
tion 4.1. MRP is a synchronous protocol for circuit-switched, pipelined routing of word-wide data 
through multipath, multistage networks constructed from crossbar routing components. MRP uses 
circuit switching to minimize the overhead associated with routing connections while facilitating 
tight time-bounded, end-to-end, source-responsible message delivery. MRP is composed of two 
parts: a router-to-router communication protocol, MRP-ROUTER, and a source-responsible node 
protocol, MRP-ENDPOINT. 


56 



In operation, an endpoint will feed a data stream of an arbitrary number of words into the 
network at the rate of one word per clock cycle. The first few data words are treated as a 
routing specification and are used for path selection. Subsequent words are pipelined through 
the connection, if any, opened in response to the leading words. When the data stream ends, the 
endpoint may signal a request for the open connection to be reversed or dropped. When each router 
receives a reversal request from the sender, the router returns status and checksum information 
about the open connection to the source node. Once all routers in the path are reversed, data may 
flow back from the destination to the source. The connection may be reversed as many times as the 
source and destination desire before being closed. End-to-end checksums and acknowledgments 
ensure that data arrives intact at the destination endpoint. When a connection is blocked due to 
contention or a data stream is corrupted, the source endpoint retries the connection. 

4.3 mrp in the Context of the ISO OSI Reference Model 

MRP fits into a layered protocol scheme, such as the ISO OSI Reference Model [DZ83] at the 
data-link layer (See Figure 4.1). That is, MRP itself is independent of the underlying physical layer 
which takes care of raw bit transmissions. MRP is, thus independent of the electrical and mechanical 
aspects of the interconnection. The protocol is applicable both in situations where the transit time 
between routers is less than the clock period and in situations where multiple data bits are pipelined 
over long wires (See Section 3.2). MRP provides mechanisms for controlling the transmission of 
data packets and the direction of transmission over interconnection lines. It also provides sufficient 
information back to the source endpoint so the source can determine when a transmission succeeds 
and when retransmission is necessary. By leaving the retransmission of corrupted packets to the 
source, MRP allows the source endpoint to dictate the retransmission policy. As such, both the 
MRP-ROUTER and MRP-ENDPOINT are required to completely fulfill the role of the data-link layer. 
Since MRP provides dynamic self-routing, the protocol layer identified as the network layer by the 
ISO OSI model is also provided by MRP. 

MRP itself is connection oriented, though there is no need for higher-level protocols to be 
connection oriented. Together, MRP-ROUTER and MRP-ENDPOINT provide a reliable, byte-stream 
connection from end-to-end through the routing network. 

4.4 Terminology 

Recall from Chapter 3 that a crossbar has a set of inputs and a set of outputs and can connect 
any of the inputs to any of the outputs with the restriction that only one input can be connected to 
each output at any point in time. A dilated crossbar has groups of outputs which are considered 
equivalent. We refer to the number of outputs which are equivalent in a particular logical direction 
as the crossbar’s dilation, d. We refer to the number of logically distinct outputs which the crossbar 
can switch among as its radix, r. 

A circuit-switched routing component establishes connections between its input and output 
ports and forwards the data between inputs and outputs in a deterministic amount of time. Notably, 
there is no storage of the transmitted data inside the routing component. In a network of circuit- 
switched routing components, a path from the source to the destination is locked down during 


57 




MRP fits into the ISO OSI Reference Model at the data-link layer. The routers in a multipath 
network use MRP-ROUTER to transfer data through the network. Each endpoint uses MRP- 
ENDPOINT to facilitate end-to-end data transfers. 


Figure 4.1: METRO Routing Protocol in the context of the ISO OSI Reference Model 


the connection; the resources along the established path are not available for other connections 
during the time the connection is established. In a pipelined , circuit-switched routing component, 
all the routing components in a network run synchronously from a central clock and data takes a 
deterministic number of clock cycles to pass through each routing component. 

A crossbar is said to be self-routing if it can establish connections through itself based on 
signalling on its input channels. That is, rather than some external entity setting the crosspoint 
configuration, the router configures itself in response to requests which arrive via the input channels. 
A router is said to handle dynamic message traffic when it can open and close connections as 
messages arrive independently from one another at the input ports. 

When connections are requested through a router, there is no guarantee that the connections 
can be made. As long as the dilation of the router is smaller than the number of input channels into 
a router (i.e. d < /'), it is possible that more connections will want to connect in a given logical 
direction than there are logically equivalent outputs. When this happens, some of the connections 
must be denied. When a connection request is rejected for this reason, it is said to be blocked. The 
data from a blocked connection is discarded and the source is informed that the connection was not 


58 







The basic router has i forward polls and o = r ■ d backward ports. Any forward port can be 
connected through the crosspoint array to any backward port. The arrows indicat the initial 
direction of data flow. 


Figure 4.2: Basic Router Configuration 


established. 

Once a connection is established through a crossbar, it can be turned. That is, the direction of 
data transmission can be reversed so that data flows from the original destination to the original 
source. This capability is useful for providing rapid replies between two nodes and is important 
in effecting reliable communications. MRP provides half-duplex, bidirectional data transmission 
since it can send data in both directions, but only in one direction at a time. When data is flowing 
between two routers, we call the router sending data the upstream router and the router receiving 
data the downstream router. 

Since connections can be turned around and data may flow in either direction through the 
crossbar router, it is confusing to distinguish input and output ports since any port can serve as 
either an input or an output. Instead, we will consider a set of forward ports and a set of backward 
ports. A forward port initiates a route and is initially an input port while a backward port is initially 
an output port. The basic topology for a crossbar router assumed throughout this chapter is shown 
in Figure 4.2. 


59 










Signal/Datum 

Control Bit 

Control Field 

idle/drop 

0 

all zeros 

ROUTE 

1 

direction specification 

TURN 

0 

all ones 

STATUS 

1 

port specification 

CHECKSUM 

1 

checksum bits 

DATA-IDLE 

0 

distinguished hold pattern 

DATA 

1 

arbitrary 


The words sent over the network links can be classified as data words and signalling words. 
The use of a single control bit which is separate from the transmitted data bits allows out of 
band signalling to control the connection state. This table shows how control signals and 
data are encoded. The control field is a designated log 2 ( max( r, d)) bit portion of the data 
word. 


Table 4.1: Control Word Encodings 


4.5 Basic Router Protocol 

The behavior of MRP-ROUTER is based on the dialog between each backward port of each router 
and its companion forward port in the following stage of routers. In this section, we describe the 
core behavior of the router signalling protocol from the point of view of a single pair of routing 
components. 

4.5.1 Signalling 

Routing control signalling is performed over data transmission channels. Using simple state 
machines and one control bit, this signalling can occur out of band from the data. That is, the 
control signals are encoded outside of the space of data encodings. Out of band signalling allows 
the protocol to pass arbitrary data. Table 4.1 shows the encoding of various signals. The control field 
is a designated portion of the data word. Due to encoding requirements, it is at least log 0 (max( r, d)) 
bits long. The remainder of this section explains how these control signals are used to effect routing 
control. 

4.5.2 Connection States 

The states of a forward-backward port pair can be described by a simple finite state machine. 
Figure 4.3 shows a minimal version of this state machine for the puipose of discussion. Each 
transition is labeled as: <event>l<resultxdir>. Where <event> is a logical expression, usually 
including the reception of a particular kind of control word, <result> is an output resulting from the 
reception, and <dir> is an arrow indicating the direction which the <result> is sent. For instance, 
the arrow from swallow to forward means that when a DATA word is received and the resulting 


60 




output direction specified is not blocked, forward the DATA out the allocated backward port in the 
forward direction and change to the forward state. 



As a connection is opened,released,re versed,and used in the netw 
w ithin the netw ork go through a series ofconnection states as sho 
are initiated cb eyitjh teorfa controlw4ili)d if St anTaldlfied bythe local state 
ofthe r outer. Each transi t< 0 vnnfe/d«eiw/e>l<e(:iizlr^a sc event> describes 
the controlw ord received alo n<g«suh>\i ss li ait e ant p d i fee osr;d re s u 11 i n g 
fro m the ecs/er> is an arrow indicatingthe ekiasa/otiisa aiew th i c h the 


Fi gu re 4ldRP-ROUTER Connection States 


4.5.3 Router Behavior 
Idle port 

Wh en a connection betw een routers isn o idles tia i e , Mihei fceo imnt b <£ t i o 
idle state,the backw ard porto IDLE w oour tcb 1 d rid si s m r irtes stjh ® n d in gfo rw 
in the nextstage ofthe netw ork.Aforw ard port interprets the rece 
thatitshould rem ain in an idle state and hence should notattem 


ork 
w n 


n i 
ard 
P t 
p t I 


61 



Route 


To route a connection th ROUTE gh a mb iist ferd ai n t o a router forw a r d p o r 
forw ard portrecognizes the transition ofthe conifc icoelibiirt grv® in e tnh e ze r o 
the connection w as idle,to a one.The router then uses the control] 
routingdirection. The ro uRQSJTE iw rtwr dt rtdh a'dlm^h a backw ard port in the 
direction,ifone is available,and locks dow iDATh woodisieahidbiEso sul 
passed in the sam e direction. If no outputis available,the connectio 

Data 

All w ords w ith control set bits w hie hRCajffB w eocrd i nerd fib lTw> w ridregda 
through the allocated backw ard portofthe ro u tin gc o m ponentto the 
in the netw ork. 

Data Completion 

Wh en all ofthe data w ords in a m essage have been passed into a r 
options. The sender can either drop t hi rag; aompiEBemst ieo, hd lwt iitihno trht ege 
direction ofthe connection around fora reply. 

To drop the connection,! hDROPnvpourtdi st Jpe'eme a eijrROPavnocr flac a u s e s 
the router to close dow n an open connection and free up the output 
p o r t i s fr e e idRQR waord,itfor \BR0irad Iso aigto the next router then returns to a 

To turn a connection a r o u n d , t hmjRNivp q t qb ;ow this rgitvb e 00 u t e r r e c e i ve 5 
TURN w ord,itforw ards the turn out the allocated output port,ifthere 
s t a t u s w o r d s . Fi gu r e 4.3 show saversionoftheprotocolw hich returns t 
fil 1 t h e pipeline delayassociated w ith inform ingthe subsequent route 
and gettingbackw ard data fr o m the subsequentrouter.The fillerw ord 
connection is sendingdata in the forw ar drtiRN irse ive <c £ a vi dr e c t i o n w hen 

In the forw ard direction,the smmJS w ronedt uamndsCfitetMSiiaM w o r d . If 
the connection w asnotblocked duringthe ro u tin gc yc le ,a fte r s e n d ii 
be receivingdata in the reverse direction and w i Id ifonr eve 1 irod nt lw i a sd a t a t 
blocked,the rout ®ROfowr w> ladr cfos 1 & o vCHBO^Uht rand returns to the idle state. 

In the backw ard direction,the ro iiD/Uex-iir n. i-: iw p kyisse bi d fs se iseerm I i n g t h e 
reverse data. In order fora portto be in the backw ard direction,there 
the router so there w illalw ays be data to propagate follow inga back 

Checksum and Status Information 

Th eSTATUS a n dHECKSUM w ords form a series ofw ord w ide values w hich s 
the source node ofthe integrityofthe eouha'otui benr inn at e tphar tch u cglto tain 
through the netw ork. AipLTOR'\ki 0 n do ifit Ifoer m sthe source aboutw hich oft! 
equivalentbackw ard ports, i f a n y, thro u gh w hich the connection isro 
arrives uncorrupted at the source endpoint, it allow s the source tc 
w hich the connection w as actually routed. Wh en a connection is 1 


62 


inform ation serves to pinpoint w cha errr ee tth leo c h n d e Til eo me hr ainingbits ( 
STATUS w o rd , to ge th e r w ith rhttf KbuiM sw i n rtdi or words,can be used to tran 
longitudinalchecksLim back to the source. Th is checksum isgenera 
the connection sRiQUTEe i hhc d d al a tRtgjnlE ®v ord itself. Wh en data is corrupted 
to fa u Its in the netw o r k , t h e checksum provides the source endpoin 
identifythe nr ostlikelycorruption source . 

4.5.4 Making Connections 

Wh en a connection is opened through a router,there nr ayornr ayno 
desired logicaloutputdirection. If there is no available output,the r 
bits associated w ith the data stream . Wh en the conn gmiliSo n is 1 a t e r t u 
w ord returned bythe routingnode inform s the source thatthe nr ess 
Wh en exactlyone outputin the desired direction is available.the rou 
t h r o u gh thatoutput. Wh en nr ultiple paths are available,the router sw 
appropriate backw amnldopiky frtosm 1 the desd available. 

This random path selection is the keyto nr akingthe protocol rob 
w hile avoidingthe need forcentralized infer nr ation aboutthe netw oi 
protocolsinr pie. Wh en fa u Its develop in the netw ork.the source d e ti 
or darn aged connection bythe acknow led gm entfronr the d e s tin a ti( 
resend the data. Si nee the routingconr ponents selec teraa arid o nr 1 y a nr c 
stage,itis highlylikelythatthe retryconnection w illtake an alternat 
avoidingthe new lyexposed fault. So urce -responsible retrycoupled ' 
selection guarantees that the source can eventu ally fin d a fault-free 
provided one exists.The random selection also frees the source fro n 
ofthe red u n d an t p ath s provided bydilated com ponents in the netw 
equivalentavailable outputs is an extrenr elysinr pie selection criteric 
can be inr pie nr ented w ith little area addmosnesl bde et f o h lae 1 s p ere^u iRae s i 
state infer nr ation notalreadycontained on the individualroutingco 

4.6 Network Routing 

Each rou ter in a path through the netw ork needs to see a different] 
w e require the routingspeci fie ation to btOUTBiwa ofixd dt <p a k lid iw> ne ffhc that 
inr pie nr entation ofthe protocol,the dead a 1 b ff e m tiarr tit ii* spt ebsei td di fife beyn t 
Betw een ro u tin gs ta ge s ,th e bits ofthe datapath can be per nr uted so t 
see distinct control fie Ids. This bit re o rd e tin g a llo w s a single rou tin 
t h r o u gh several routers. Ho w ever,ifthe netw ork is s u ffic ientlylarge,; 
rou tin gw ord w ill e ve n tu a llyb e exhausted before the fullroute througf 
To d e a 1 w ith t felHBad hes w, s routingsw itches to be con figu red to ignore th 
in an inconr ingnr ess aS^Lhoyvsoed tni figgiar ation bit.This option allow s n e ti 
be arbitrarilylarge. Eve rytim e alio fttthUi'Eiwo a t dn qgirei tesxhna BbiffE d , t h e 
w ord can be discarded allow ingroutingto continue w ith the fresh 


63 


w ord. In this m an ©MA yt hoer dir sit the m e s s a ge fo 11 OR0UTB ^ tda led oiisi gi n a 1 
prom o t e d tRDllTKwt hoerd a ft er the originalis exhausted. 

4.7 Basic Endpoint Protocol 

Netw ork endp MRPrEltfBP(DffN>ret o guarantee the deliveryofat least one un< 
copyofeach m essage stream to the ah d spi o a ri tdcehsfi hnnaet 1 ® m. tAne es o u r c e 
ofa connection net work zb^&fdwa hile a reciEd\pnDgnt chann erietwsork ailed a 
output. At t h e m ostprimd hiordi lie ® r etiW,pourtkbie haves like a router backw ard { 
each netw ork o utput behaves like a router fo r w ardipuDtrh. nTdh e c o n tri 
output,how e ve r, i s m ore involved than the sim pie data stream handl 
backw ard ports as described in the Section 4.5. 

4.7.1 Initiating a Connection 

Wh en a node widim ttsotan d* c tio n overthe netw ork,the netw ork inpu 
header and message checksum to the data and send it into the ne 
then follow ed w QRdP e fltllRM toa indicate the disposition ofthe link folic 
transm ission.The initiabikselsikaege thus 1 

(ROUTE)* o (DATA)* o (DAT Checksum)* ° TURN 

<OR> 

(ROUTE)* o (DATA)* o (DATA-/ )* o DROP 

Th eROUTE w ord orw ords specifies a path to the desired destination, 
how the datapath and routers can be con figu red so that u n i qu e rout 
routingcom ponentin the path betw een the source and the destinat 
constructed accordingly. 

In ge n e r a 1, a n o ltd phlae snm tnvp anitk.iln the sam e w aythatrouters choose 
am ongthe available logicallyequivalent outputs,the node should cl 
available netw ork inputs. The bene fit s in term s ofdynam ic faultavo 
the routers as discussed in Section 4.5.4. Wh eritijahenGbifieoeriktipauthe ha 
specification w hich reach tih®dsajiase\de)sut]idii)tdothe case in an extra-st; 
netw ork (Se c t i o n 3.3.4), h ho ai bcb □ hooeo $ e random lyam ongthe available p 
the netw ork. This random selection avoids w orst-case congestion o 
the netw ork and gives the routingalgorithm the propertythat it can . 
netw ork. In these extra-stage cases,w e are sim plym ovingthe rando 
from inside the netw ork to the originatingendpoint. 

Each m essage should be guarded w ith a checks u me © invi tnhge m e s s a g 
endpointcan identifyw hen a m essage has been corrupted. The lengt 
chosen so thatthe probabilityofa corrupted m essage havinga good c 
forthe intended application. The checksum should be constructed 
m is ta k e n 1yd e live re d to an in c at iceecpt tm d daes w vl 11 ndo rtrbeesisea.gdnaet that 
w ayto ensure this is to include the destination node num berin the 
anotherw o u Id be to seed the checksum as ifthe first p o rtio n ofthe d 


64 


butnotactuallytransm itthenodenum ber. It is not, in general.possib 
in the checksum and use them in place ofthe node -n um berforassu 
the correct destination. Wi th extra-stage netw orks or tree netw orks, 
destination w ith m anydiffe rent route -path specifications.In such cas 
notunique to eachrdidEt am adt isoonm e ofthose ro u tin gw ords w illhave be 
the m essage in the netw ork (Se e S® cntm oe nr t4.6) ib a - foarm ht Ai se tcho eddestinatior 
If the stripped ro u tin gw ords w ere used in c a lc u la tin gth e checksum 
no w ayofknow ingw hatthe stripped routingw ords w ere and hence 
com putation. 

If the sendingnode needs to guarantee t head d lh\® rh be jstshaegd w sat snaacttiio ail 1 
node,it m ustturn the netw ork around aftersendingthe data rather 
Un less the node turns the netw ork and gets a reply from the destinat 
endpointw illnotknow w hathappened to the m essage inside the ne 
responsible form essage retransm issioninthecaseofnetw ork cor ru 
the m essage for retransm ission u n t i 1 a s o e tia/d d efraocnk rt b ev d e d tginm a tni b ins . 

4.7.2 Return Data from Network 

Aft e r send iTUglU hnet o the netw ork,the so ueroceei een sdtp bui a&wdldhecksum 
inform ation from each router in thaip ead hi ® rp .e Eoerd t lb <y tshi oi qi 1 i fie d r o u 
protocolshow n in Figure 4.3 the replies w illlook like: 

(STATUS o CHECKSUM f o DROP 
<OR> 

(STATUS o CHECKSUM) V o (DATA)* o (DATA ; )* o DROP 

<OR> 

(STATUS o CHECKSUM o (DATA)* o (DAT Checksum)* 0 TURN 

FlerdVis the num berofstages in.i hse t h e tiwu am - lb .earnodf stages into the netw 
connection w as routed be .fo<r#).i inwt he b hcs e kwe he(re a connection is bl 
the source w ill onlyreceive this status inform ation up to and inclu 
blockingoccurred. As noted in Section 4.5.3, the checksum and statu; 
source endpointw ith inform ation w hich allow itto localize the sou 

It is im portantto note thatthe checksum s com ingback from the 
determ ine w hether or not the des tiaisastfiuolAyff n d cpicvd id 11 Bn a sdsaut a . Si n c e 
d yn a m i c fa u 11 m a y a r i s e a t a n y p o i n t i n t i m e , a fa u 11 m a y, fo r e xa m p 1 e , o 
the m essage data w as sentpassed the routerbutbefore the router pa 
case, a corrupted checksum could seem to appear from a router w 
corruption. For this reason,the checksum s from the routers serve o 
source endpoint m ustuse inform ation from the destination endpoi 
ornotthe destination received the data uncorrupted. 

Wh en a connection is com pleted th r OTURghr d la e In as st \bhoer <k e as ftienrattli e n 
node,the destination has the opportunityto reply. At the ve r y 1 e a s t, t h i 
data stream arrived uncorrupted. Dependingon the application,the 
send replydata alongw ith this acknow led gm e n t. Wh en the destinati 


65 


data should also be guarded w ith a checksum to protect a gainstdyn 
Wh en the destination sim p ley a e kpntoo# al tnd egs kdnlgee!4rhl e ek grn entencoding 
should be chosen so thatthere isiiiut plehi a In d liy e |pan&olvl3'pdi®odbgmb e n t c a n 
be corrupted i n t ok an p <w s liet idvgnn e n t. Aft er its reply, the destination m aye 
dow n the netw ork connection or turn the connection around forfui 

4.7.3 Retransmission 

We m aysurm ise that a connection has failed to transfer data suc< 
follow ingoccur: 

1. The path is blocked due to resource contention in the netw ork 

2. The destination node indicates thatdata w as corrupted upon ari 

3. The return data stream does notadhere to protocole xpectations 

In anyofthese events,the source m ustretransm it the data ifitw ish 
ofuncorrupted data.The fir s t e ve n t m ayoccurw hen the netw ork is c 
options indicate thatthere is a fa u It in the network. Blockingcan a 
kinds ofnetw ork failure.The faultin the netw ork could be transien 
permanent fa u It. Si nee the endpoint does not know which kind o 
single fa u It occurrence is notconclusive evidence that a particular 
Consequently, the node endpoint m ayw ish to save aw aythe replydat 
faultanalysis. 

Wh en retryingthe transrn is sio n ,th e source node has som e freedom 
The source m aychoose to retrythe sam e m essage ora m essage to a i 
m aychoose to retryim m ediatelyor after a w aitperiod. Wh i c h t e c h n 
the re qu ire m ents ofthe application. If the application expects the m 
deli ve red to theirdestinationsin the ordertheyw ere generated,the n 
choose a differentm essage.Since the path on a retransm ission m ayb 
just taken, it m aybe beneficialto irn m e d ia te lyre try th e failed conne 
w as due to blocking. Wh ile m uch w ork has been done on backoffand 
s ys t e m.gs (HLw n]), retransm ission policies fo r this class ofnetw orks re 
research . 

If the netw ork continues to retain com plete connectivitybetw een 
the source w ill eventuallybe able to deli ver its m essage to its destii 
blockingoccurs atsom e stage in the netw ork,som e m essage has bee 
in the netw ork. Th us,in orderfora conn esc a ieocnnti® ebcet iho Ino mkuesdt h ta s/4 a £ 
progressed .4-0 lsFo^dow ingthis reasoning, as longas com plete connect 
netw ork,som e connection mdupset ibnet .raenaet fit hnegietfo er e , fo r w a r d routing 
is alw ays being m ade. If allconnections are treated equallyw ithin the 
chance ofbeingrouted through the netw o r k . Th u s , w e can expeetth; 
w ill eventuallycom plete as longas a path existsto the desired endpo 

Ho w e ve r, i f t h e netw ork has lostfullconnec tivityd u e to new lyarisin 
m ayno longer be re aiet ihoanballel y, Afltdi ere is a large am ount ofcontentio 


66 


destinations,the num berofattem ptsnecessaryto delivera m essage 
p r a gm atic m a tte r,w e often lim itthe num berofretries allow ed.Ifa cc 
i n a fixe d number o ftrials ,th e cMRK-BNlSPOlMrar n pfaoi il 6 s atih ds inform ation bac 
to the node. At this point,the node m ayw ish to com m unicate w itl 
the source and nature ofits problem . Tli td ant e di ee tnv aoyrakl sdoi awgii »hs ttioc i nt ( 
verifyifnodes have actuallybeen new lydisconnected from the netw o 
from contention,then the node can take this opportunityto inform 
m anagern entprotocols ofexcessive contention. 

4.7.4 Receiving Data from Network 

To m inirn ize end-to-end netw ork latency, a system m aybegin proc 
stream in parallel w ith the reception ofthe rem ainder ofthe data 
receivingdata from the netw ork,how e ve r, h a s no guarantee ofthe int 
receivinguntilitsees a checks u m ogflh a md a "Pb ee ghnc p rwn g s s in gth e data a 
as itarrives onlyas longas itcan guarantee thatthe processingitdoes 
w illhave no adverse affects ifthe data is corrupted. 

For exam pie, consider a netw ork operation w hich is intended to 
w ritten into the ndoeds & i’si m t earn ory. Ifthe onlychecksum w as atthe end o 
the destination node cid iunlgidnad A tore tgi nmweim o r y a setdieei deadt i> iescbaeuisiegr 
the address could be corrupted. In such a case, a corrupt address 
to w rite data over som e arb itraryp lace in the node’s m em ory. Sim il 
guarantee aboutthe length oefd la evdnagt a4ht evtwlbtrkfaultcould cause the 
to send w hat appears as m ore data than the original, uncorrupted m 
w ould cause data in m em oryfollow ingthe intended destination bio 
these problem s,the m essage data could start w ith the destination 
are guarded w ith their ow n checksum precedingthe actual data trs 
the address and length are c o rre c tlyre c e ive d , th e data m aybe s to re ( 
corruption occurs in the data itself, the source w illretransm itthe d 
w hich the m em oryis being m a in ta in e d ,it m aybeaneeessta rtjhfe r t h e n 
m e m o ryb e in go ve rw ritten untilthe fin a 1 ve r i fie a td o o i <vef dhe integrityof 

Anode me <a ^ r ve a bad data stream for either ofthe follow ingreasons 

1. Checksum (s)indicate m essage m aybe corrupted 

2. Data stream does not adhere to protocole xpectations 

Wh en this happens.liboedieeicseaim hygp xpected to indicate its rejection oft! 
If the r e c eii »i d g can give som e indication ofw hythe data stream w as re 
m aybe able to use that i n fo r m ation w hen failure diagnosis is necej 
piece ofinform ation the source node r e qu i r e s is the factthe connect 

4.7.5 Idempotenee 

In the introduction to this s MRPfEMDfiOlwr egusaariain ttte a fc the uncorrupted 
deliveryofatleastone copyofthe m essage to the destination. We m i gl 


67 


the deliveryofexactlyone copyofthe m essage. Ho w ever,source -respo 
the potential fordyn am ic faultoccurrences allow s form ultiple deli 

Considerw hathappens w hen the source receives a corrupted che 
or other indications that so m ethingis w r on g. It is safe to assum e sor 
notbe clear w here the problem occurred. Pa rticularl y, ifthe faultonl 
the destination,the source can belie ve that an operation failed w h 
success fu 11 y. Wh en the source thinks the operation has been corru 
nature ofthe protocol has the source re try th e operation since ther 
operation has not been accepted bythe destination. Ho w e ve r, w hen 
in w hich onlythe return data oracknow led grn entw as corrupted,th 
data stream a second tim e. 

The consequence is thatalloper MR# annus spt/draefmtraM. Thda it s a ppge r - 
form ingan operation mmdttipiedi unc as sdci fferentresultsfrom p e r fo r m ini 
once. Ourprevious exam pie (Se c t i o n 4.7.4) ofa cross netw ork w rite op 
since w ritingthe sam e data tw ice w illnotchange the data w hich i 
e ve r, a netw ork operation w hich caused a rem ote counter to be in c 
idem potentsince incrern entingthe counter m ore than once w o u Id 

There are a few choices fordealingw ith the idem potence requiren 
all operations MRPhdiicrte aitdyto be idem potentorw e can irn plern enta 
betw een appli tmaptw ohni s h igd arantees idem potentm essage delivery. 

The Transm ission Control Protocol (TCP), in use on m anylocal-are 
“reliable’’data stream s byusingsequence num bers [Po s 81] t o guaran 
d e 1 i ve r y. Wh en a source needs to com m unicate w ith a destination,! 
destination fora valid setofsequence num abcehr <u. nThqa s qo in lcckee dHDiffibaaae 
transm itted to the destination w ith a differ entree cjuee ki c as pnsutma b k r . Th i 
o f a 11 the sequence num bers it has seen so thatexactlyone copyofe 
destination is passed alongto higher-levelprotocols.In this m anner 
source -responsible retransm issiea cahr anfibtsssrae gh ® ififat cnti sekliyri ^ e m p o t e 
the protocollevelabove TCP. 

Wh ile one could irn plern enta TCP-s t yl e u n i qu e s e qiM&m scue c rh u m b e r p r c 
a solution is ine ffic ientforhigh -speed com m unications in a 1 a r ge -s c a 1 
overhead in term s ofthe space and processingtim e required to track 
m essages based on sequence num bers could easilybecom e m anyti 
space required for basic m essage tran sZjpa ns2| u igi ei aat i }d Ai hem <n a t hve i y, a s e 
designingthe low estlevelcom m unications prim itivesto be idem po 
avoidingthis cost. 

4.8 Composite Behavior and Examples 

Havingdetailed th MRb ians th s f f e vio u s sections,this section review s t 
behaviorand show s severalrepresentative exam pies ofprotocolope 


68 



Fi gu r e 4.4: 4616 Mu 11 i b u 11 e r fly Ne t w o r k 

4.8.1 Composite Protocol Review 

The source endpoint feeds m essages into a router in the fir s t s t a ge o 
w ord is trea RCldraw fihrd . At e a c h ROaJ^i^t hetherpipelined through the net 
or blocked byexistingconnections.Betw een router stages,the bits a: 
rou tin gbits to each router. Wh en the bits are exhausted the subseque 
to s w all oROUTEhve ord and prom BAtTA \V hoer dr t <b be RtotCTEnwe wrd . Wh e n 
the entire m essage is fed into the netw ork, thURNsw) a r d et av rid kq>e siee r a 11 y 
the connection. The fir s t r o u te r iSffiATtllS a m CfaESKSaJM Iwroeitdismsid forw ards 
the data received from the second router in the netw ork. The firstre 
rou ter in the netw ork w ill b gTAfiussasnedHECHStifMrw i® tradrs’s Th e source w ill, 
thus successi vSTF^tJS-CMEdKfflUM w ord pairs from e a c b an u tetiai .tlfehoe 
connection is blocked atsom e point,t iDROtb fbol 1 ck w d e r w ill send 

CHECKSUM p air; HfiROF w ill close dow n the connection as itpropagates ba< 
the connection w as not blocked,the source w illreceive data from 
fin aStTATUS-CHECKSUM pair.The connection m aybe reversed orclosed bythe i 
has com pleted its reply. 

4.8.2 Examples 

Considerthe netw ork show n in Figure 4.4n 'Jihuet hos siib tlpe uptalhha a - tfr o m 
highlighted.Forthe sake ofsim plicityin exam pies,let us consider th< 
the paths betw een in p u t 6a n d output 16. The sam e protocol is obeye 
exam pies easily generalize to the com plete netw ork. 

Each ofthe follow in ge xa m pies (Figures 4.5 thro u gh 4.10), show s severa 
cations overone ofthe paths indie aotnddn tEiqgm rhe e4t4v Eamihrcouters is lal 


69 









w ith the control/data w ord transm htit ©dt adt a d i w gi tt fla et k yc d a r ae be di ca n o f d 
flo w . Oft en w ordsofthe sam e type are subscripted so the progression 
tracked fro m c yc 1 e to c yc 1 e . 

Opening a Connection 

Fi ga r e s 4.5 and 4.6 show how a connection is opened through the ne 
Figare 4.5 s a c c e e d s inoaipiacncimqgn through the netw o rk ,w hereas the oni 
blocked atthe router in the third stage. 

T i me 



Sh o w n a b o ve is a c yc 1 e -b y-c ycle progression ofcontroland data throug 
a connection is success fullyopened from one endpoint to another. 1 
through the netw o rk advancingone routingstage on each clock cycle.Tl 
m essage is the ro u tin gw ord. 


Fi gu re 4.5:cSa essfulRoute t h r o u gh Ne t w o r k 


Dropping a Connection 

Fi gu re 4.7 show s an open connection beingdropped in the forw a 
a connection from the reverse direction proceeds id e n tic a lly w ith 
destination reversed.Ifthe con lDKOT i i opnr d> sp ba Igfa C ekde d pt h oe t h e router a t w 
the connection is blocked. 


70 



T i me 


y 

te r 


y 

te r 


y 

te r 


y 

te r 


y 

te r 


y 

te r 



ID L E 

R o u 

ID L E 

R o u 

ID L E 

Bus 




R o u 


^ R out,! 

R o u 

ID L E 

R o u 

ID L E 

Bus 

r * 

t © 1 ^ 

t © 1* * 

R o u 


< 

hO 

< 

O 

R o u 

R °UT lS 

R o u 

ID L E 

Bus 

r i 



R o u 


< 

H— 

< 

Q 

C 

R o u 

□AJA 

R o u 

R CUFI 

Bus 

r n 

t © 1* ► 

t © 1* * 

R o u 


S D A I A 

R o u 

< 

H— 

< 

Q , 

R o u 

< 

ho 

< 

O , 

Bus 

r * 

ic i-n 

ic i-n 

R o u 


DA i A . 

R o u 

D A J A 

R o u 

D A J A 

Bus 

F ’ 

t © 1* * 

t © 1 ^ 

R o u 


In the eventthata connecctdesis cfrn M yi go ft <b te es dt t h to 
nentin the netw ork.the m essage is discarded 
show n above depicts such blockinga router in 


ugh som e to u tin gc ( 
w o r d -b y-w o r d a t t h 
the third stage ofth 


Fi gu re 4.6o ffl n e c tio n Blocked in Netw ork 


T i me 


D A J 4 A 


R o u 


Ta A 


R o u 


R o u 


R o u 


R o u 


R o u 


r 

L 

+ 1 

D A J 4 A 

r 


L 

D R OP 

r 


L 

ID L E 

r 


L 

ID L E 

r 


L 

ID L E 

r 


L 



R o u 


R o u 


R o u 


R o u 


R o u 


R o u 


L 

D A J 2 A 

r 


L 

d a t 3 a 

r 


L 

D A J 4 A 

r 


L 

D R OP 

r 


L 

ID L E 

r 


L 

ID L E 

r 





R o u 


D A J,A 


R o u 


R o u 


R o u 


R o u 


R o u 


r 

L 

1 — 1 

D A J 2 A 

r 


L 

D A T 3 a 

r 


L 

D A J 4 A 

r 


L 

D R OP 

r 


L 

ID L E 

r 


L 



R o u 


D A To A . . 

he-r- *(d S)T 


R o u 


DA|,A 

+«-r-►(D S)T 


R o u 


R o u 


D A J 2 A 

he-r---*(D S)T 


D A JgA 

+«-r-►{ D S)T 


R o u 


D A J 4 A 

+«-r- s)t 


R o u 


D R OP 

+«-r-►( d s Jt 


Wh en the transm ittiid pnoei trot cbffkidestoterm in ate am essage.itends 
w i t hDRDP control wo ffiRbP ffihld ow s the m essage through the netw ork r < 
each link <d m Inheec tio n to idle aftertraversingthe link. 


Fi gu r e 4.7:cDp pinga Netw ork Connection 


) m p 
a t r c 
e n e 


t h e r 
:set 


71 




Turning a Connection (Forward) 

Fi gu re 4.8 show sa success fulconnection beingturned and 
the netw ork.Figure 4.9 show s how a blocked connection is c 


T i me 



Wh en the source w ishes to know the state ofits connection 
destination, irutfts erdtsoathe netw ork follow ingthe end ofits forw 
data. As HLbtie w ork s its w aythrough the netw ork,the links ittra 
In the pipeline delayrequired forthe link to begin receivingda 
each rout pnogn e nnt sends status and checksum inform ation to 
connection stat lURlAft aistihieopagated all the w aythrough the ne 
routers alongthe connection have sent status and checksum 
alongthe connection. 


Fi gu re 4.8: Re ve r s i n g a n Op e n nNn few ta ® ki C 


Turning a Connection (Reverse) 

Turns from the reverse direction proceed basicall ysaMlfc r w 
a n dHECKSUM w ords,rout wlords when turned from the r 

Fi gu r e 4.10 s h o w s a turn fro m the r e ve r s e direction. 


) a c k w a 
o 11 a p set 


i and ge t 
aid t r a n 
ve r s e s a r 
t a in the 
i n fo r m t h 
t w ork an 
w ords, 


aid turn 
e ve r s e d 


72 



T i me 


Op da t 2 \ 

R o u 

D A 

R o u 

D A T 0 A 

Bus 




R o u 


T UR N 

R o u 

D A J 2 A 

R o u 

D A J.|A 

Bus 

r ' 

t © 1 ^ 

t © 1 ^ 

R o u 


S T A Tj US 

R o u 

T UR N 

R o u 

D A J 2 A 

Bus 

) 9 



R o u 


CHE Clf 

R o u 

S T A US 

R o u 

T UR N 

Bus 

'O 8 


t © 1 ^ 

R o u 


S T A US 

R o u 

CHE C£ 

R o u 

S T A ^ US 

Bus 

r 



R o u 


CHE C£ 

R o u 

S T A ^ US 

R o u 

O 

HI 

X 

o 

Bus 

r 

H © 1 


R o u 


S T A pS 

R o u 

O 

LU 

X 

O 1 

R o u 

D R OP 

Bus 

r 



R o u 


O 

LU 

X 

o 

c 

R o u 

D R OP 

R o u 

ID L E 

Bus 

— 

H © 1 

t © 1 ^ 

R o u 


D R OP 

R o u 

ID L E 

R o u 

ID L E 

Bus 

r 



R o u 



ID L E | 

R o u 

ID L E 

R o u 

ID L E 

Bus 

r n 

t © 1 ^ 

t © 1 ^ 

R o u 


y 

te r 


y 

te r 


y 

te r 


y 

te r 


y 

te r 


y 

te r 


y 

te r 


y 

te r 


y 

te r 


y 

te r 


In the event that the connection was blocked at some router in the n i 
router w illbe unable to provide a reverse connection through the net 
se n din gits ow n checksum and status w o rldROPlb ea b kotaa kt fa at router w illse 
source. AaROh peropagates back to the source, itresets the inter veningnet 


Fi gu r e 4.9: Reversinga B1 o c k © di ifcfas te\tf it® mk C 

4.9 Architectural Enhancements 

Beyond the basic ro u tin gstrategy, there are se veral p ro to c o 1 e n h an ( 
p e r fo r m a n c e u n d d 1 icoenrst a i n c o n d 

4.9.1 Avoiding Known Faults 

As described so fa r,allrouter decisions are m ade purelybyrandom 
faultshave occurred in the netw ork and are know n to exist,itw ould 
view point,to determ inisticallyavoid them .In extra-stage style netw oi 
the fa u ltyp a th (s )fro m our list ofpotentialpaths. Wh en random lysele 
onlym ade from am ongthe setofpaths belie ved to be non -fa u It y. In di 


73 





T i me 



Aconnection flo w i n Igidkwctrd d hr the Sion can be reversed again so that data 
m ay flo w ,once a ga i n , fr o m the original source to the originaldestinatio 
s i m i 1 a r to the originalre M&ATA-IDLE ad jh; teapits tiheat ti r n e d duringthe r e ve r s a 1 
pipeline delayratherthan checksum and status in form ation. 


Fi gu re 4.10: Re ve ® snen 6 c t i o n Tu r n 


them selves are m akingthe detailed path decisions. For dilated rout 
router to avoid fa ultylinks or routers. 

Port deselection is one w ayto achieve this determ inistic fa u lt-a vo id a n c i 
components. Th a t i s , i f w e have a way teoa d la she d e <k tw ca rr d u pr q rd ,fftv e cat 
determ inisticallyavoid ever traversinga know n faultylink or attem p 
router. The sent antics ofthis deselection are such that the deslectec 
itis alw ays busyand hence rent oved from the setofpotentialbackw 
direction. Wh en connections are routed through the router.the dese 
n e ve r used. 

Forthe sam e reasons,itis usefulto be able to deselect forw ard po 
to a faultyrouter or fa ultyinter connection link m aysee spurious dat 
inter feringw ith the norm aloperation ofthe restofthe ro u tinge o m p< 
por tea uses the forw ard portto ignore anyconnection requests itrec 
Afaultylink betw een routers is thus excised from the netw ork byd 
backw ard port pair attached to the link. Afa u ltyro u tin gc o m poneni 
from the netw ork bydeselectingallthe backw ard and forw ard ports 
Chapter5addresses the issue ofidentifyingfaultsources.Chapter5a 


74 




Shown above is a connection blockingscenario where the successf 
show n in bold and the blocked connections are show n in thick gray. Cc 
success fullym ade betw een nodes 16a nd 15 a n d betw een nodes lOand 
connection betw een nodes 7a nd 15 w as blocked in the third stage si 
no free outputports in the intended direction. Sim ilarly, a connection 
nodes land 15 fa i 1 e d due to block in gin the fourth stage.Each ofthese bio 
con sum es ro u tin gre so u rc e s up to the stage in w hich blockingoccurs 
non -blocked connections con sum e resources. New connections w hi 
these blocked connections continue to utilize netw ork links can,in ti 
fa i 1 e d connections. 


Figure 4.11: Bio eked Pa tlFt jsbiun ta eMifly Ne t w ork 


fo r r e c o n figu ration and considers netw ork recon figu ration in m ore d 

4.9.2 Back Drop 

As described so far,w hen a connection b es ionmt 1e s b idw koerdk atth seo m e 
path fr o m t h e s o u.srrceemt <a isnt agepen.The router links w hich the connec 
up to s fea - gem ain allocated to the block llikNcooiDROPead ttihaen tia iilt 6 If tt flu e 
m essage w orks itw ayup to the blocked router. If the netw ork is ver 
in later netw ork stages w illhold resources in m anyroutingstages ar 
Figure 4.11). Further.the lengtbrorf feicnl imtihiesdield open w illdepend on 
ofthe inbtmanlecction data. Longer data transm issions w illexacerbate 
connection on the restofthe netw ork. Si nee the blocked connectio 
stagetheym ayin turn block connections atearlier stages. 

To m in im ize the detrim entaleffects ofa blocked connection on su 
can be given the abilityto s hount d e <wt ino at nfroo pme it hce head ofthe data stre 
requires som e w ayto propagate the inform ation thatthe connection 


75 











T i me 



y 

le r 

y 

te r 

y 

te r 

y 

te r 

y 

te r 

y 

te r 

y 

te r 

y 

te r 

y 

te r 


With fa st path collapsing, a blocked connection is collapsed in a pipe 
the pointin the netw ork w here blockingoccurs. The exam pie show n 
a connection encounters blockingand is collapsed usinga back dro 
routers in the connection closest to the blocked router are freed earl 
to the source end oft he connection. 


Fi gu re 4.12: Exa m pie ofFastPath Re clam ation 


open connection to the source. On e sim pie w ayto achieve this is to a 
betw een each pairofbackw ard and forw ard ports inside the netw 
i n p u t a at odh netw ork outputand theirassociated baohw eacr tdi at m d fo r w : 
becom es blocked,the router w hich notes the block had% uses this b a 
drop line, to inform the upstream router thatthe connection is blocked 
now here.The upstream routerm aythen deallocate the backw ard po 
it fo r re u s e ,a n d pass alongthe blockinginform ation to its ow n u p s tri 
m anner.the connection m aybe collapsed in a pipelined fashion sta 
and propagatingback to the source. If the source endpointissign ailed 
line before it has finished sendingthe m essage,ittoo,can abort the ] 
source m aythen begin to retrythe connection. Figuirn dld2 iioenp i e t s h o v 
collapsed usingthe fastpath reclam ation. 

Fa st path collapsinghas a num ber ofpositive effects on perform a 


76 




resources w hich are notbeingused to transrn itsuccessfuldata.Sinc 
at the head ofthe m essage w here b lo c k in go c c u rs , th e ro u tin gre so u 
are freed m ore quicklythan those in earlier netw ork stages. This is f< 
w hich block in later netw ork stages tie up m ore resources in the ne 
detrim entaleffect on the network than those which block in earli 
Figure 4.12 w e see blockingoccurringin the third stage.The router in t 
from transportingdata in a few cycles. As a result, the total tim e thi 
is occupied w ith the blocked connection is srn all. The router in th 
backw ard portcarryingthe blocked connection foronlya couple of< 

Fastpath collapsingalso bene fit s fa u It to le r a n c e . With o u t s o m e for 
m a tio n ,o n lyth e sendingendpointhas the opportune teyitvi ns u t d o w n 
endpoint m ustw aitforthe connection to be turned or dropped be! 
outputornetw ork in p u t c a rryin gth e data stream .If a fa u It o c c u rs dur 
a router to continuallysendnabsdteahtiisenre cwt tavg" ig {§ sohwi n ith e ctio n . 

The ro u tin gre s o u re e s cons urn ed from faultyrouter up to and inclu 
netw ork input or output rem ain unusable asitiog aosf tffa set ija a till p ersi 
collapsingaddresses this problem .If a connection continues to p ro i 
the netw ork endpoint expects the connection to turn or com plete, 
conform to the expected pro hodepod i hht onr ayuB\dntlge bactliadtc oap line to i 
collapse ofthe path from the downstream end ofthe connection. Oi 
faultyrouter,w hich is continuallysendingdata,w illnotaffectthe net 
routerm aycontinue to send data to itsirn m ediate neighbor. Flo w e ve 1 
the associated backw ard port and w illnotforw ard the data an yw h e 
a netw ork endpointcan shutdow n a faultyconnection stuck in the < 

The disadvantage offastpath collapsingis thatthe source no longer 
inform ation backfrom everyrouterin ablocked path. The source w ill 
inform ation back from everyrouterforconnectionsw hich are n o t b li 
w hen connections are blocked,the fastcollapse does notallow the 
this detailed connection status. Si nee this inform ation is o f in te re i 
basic functionality, fa s t ]h a> tih 1 d cb Id aspu spi p g s t e d as a con figu ration o p t i o 
be disabled and enabled bythe testings ys tern .Fastpath c o 11a p s in gw 
operation and disabled as necessary for fa ultdiagnosis. 

4.10 Performance 

The perform ance data presented i'.n. tEigup ees vB(21iasncdh S.|9)tewr e( r e 
gathered on dilated nMRH wdtihkfsaatspmgh collapsingand faultm asking. \ 
MRP allow s the perform ance to degrade gracefullyw ith the netw ork a 
netw ork. 


77 


T i me 


R o u 


t e r 


F a u 
R o u 


il L 


te r 


R o u 


GA n B 4 GF 

■t-e-r- 


R o u 


GA R B A GE 

t e r >[d s)t 



ID L E | 

R o u 

ID L E 

F 

a 

u 

| t £A R B ^G 

R o u 

GA R B ^ G 

R o u 

GA R B ^ GE 

<y 1 


R 

0 

u 

te r 


BACK-DROP V 


R o u 


t e r 


F a u 
R o u 


| t GA RB4GF 


le r 


R o u 


GA R B GF 

+-e-r- 


R o u 


D R OP 

+«-r-S)T 



ID L E | 

R o u 

ID L E 

F 

a u 

1 ^ ^jA r b ^ g 

R o u 

D R OP 

R o u 

ID L E 

Ky 1 


R 

o u 

t QAck-drop 

BACK-DROP 



R o u 


t e r 


F a u 
R o u 


lt^ AR B ^ GF 


t %ACK-DROP 


R 0 U 


t e r 


R o u 


ID L E 

l-e-T-4 d s)t 


x ID L E 

R o u 

ID L E 

F 

a u 

l t £A R B ftG 

R o u 

ID L E 

R o u 

ID L E 

)° ‘ 


R 

o u 

t QAck-drop 




R o u 


t e r 


F a u 
R o u 


M 


GA R B A GF 


t %ACK-DROP 


R 0 U 


t e r 


R o u 


ID L E 

+fl-r-S)T 


Aro u te r (o r in terc o n n e c t)m aydevelop a faultsuch thatitappears to alw 
The exam pie show n above depicts how backw aid path reclam ation £ 
routers in the path to be reclaim ed at nhdep roeiqu testofthe receivinge 


Fi gu re 4.13: Ba c k w a r d Re c lea nnn astd Oil® ra SCu c k Op e n 

4.11 Pragmatic Variants 

There are a num berofvariants on the basic protocol that arise fro 
e r a t i o n s . On e prim aryconsideration is the difference betw een the la 
an operation and the frequeneywith which we can begin new opei 
look at several points in the protocol w here pipelingthe transrn iss 
connection bandw idth since the latencyinvolved m aybe gr eater tha 
can be accepted. 

4.11.1 Pipelining Data Through Routers 

Ifw e can clock data betw een routingcom ponents faster than w < 
ro u tin gc o m ponent.w e m aybe able to achieve higher bandw idth b 
m ultiple clock cycles to travposneetihte Tto ius tp m g Cioc ran 1 a r 1 y m akes sense w 
data can be clocked ata m ultiple o ft he sw itc hpi n gleait d nPcjy fell rro iing^i t h e 
data through the routingsw itches MRffi-ffiOtlftER Iper bnt opcaoclt (Tha ethoeie place 
w here itdoes show up is w hen connections are reversed. In stead o 
delaybefore return data is available (Se e Fi gtuioendaSjLttb' eor e ye lid k tfoeraenveaidyd 
additionalpipeline stage through the routingsw itch.These addition 
faetthatitis necessaryto flush the router’s pipeline in the forw ard dir 


78 








































































T i me 



Wh 

e n pip 

e 1 i n i 

n g th 

e d a 

t a t r a 

n s fe r 

t h r o u 

gh 

r o 

u t i ri tgi 

o oi m 1 p 

0 

n 

e n t 

S : 

, th 

d e 

1 a y c yc 

1 e s w 

hen 

the d 

lirection o 

f d a t a 

t r 

a n 

s m i s s 

i o n 

i s 

r 

e 

ve r s 

e 

d . 

t o 

the n e 

e d to 

flu s h 

t h e 

route 

r p ip 

e 1 i n e 

i n 

th 

e fo r w 

a r d 

d 

i r 

e 

c t i o 

n 

a r 

direction. Sho 

w n a 

b o ve 

i s a c 

o n n e 

c t i o n 

t u 

r n 

when 

a s 

i n 

gi 

e 

add 

i i 

t i o 

a d 

d e d to 

each 

rpo cu it fenngtc 

o m 














Fi gu r e 4.15: Exa m pie Turn w ith Pipelined Routers 


ere 
Th e 
i d r 
n a 1 


80 



4.11.2 Pipelined Connection Setup 

The longestlatencyoperation inside a ro u tin gc o m ponentis often 
w hen arbitration m ustoccurfora backw ard port. If w e require that 
the sam e am ountoftim e,or the sam e nunr berofpipeline c yc 1 e s , a s 
through the com pone n t,th e la te n c ya sso c iate d w ith connection setu 
the latencyassociated w ith data transferthrough the router. Alterna 
generallym ore pipeline c yc 1 e s , fo r connection setup than fordata tr 
this,each routerw illconsum e a num berofw ords equalto the differe 
latencyand the transrn ission pipeline latency from the head ofeach 

Co n s i d e r, a routerw hich can route data alongan established path 
three cycles to establish a new connection. The routerw o u Id consul 
head ofeach ro u tin gstre a m before otnwn ae s: &ik> lie 4 a& d sf<6 ar Iw 1 aa'M d he e r e m a 
ofthe data out the allocated backw ard port. Fi gu r ® d.b6es <h td <wns the da 
establishm entin this scenario. 

Wh en pipelined connection setup is used,there is no need foran 
on each router since each router alw ays cons urn es one orm ore w o 
stream itroutes.The onlyother accom m odation needed) f<m it this kin 
constructthe header w ith the appropriate paddingbetw een routin 
the sees the proper routingspeci fie ation. 

4.11.3 Pipelining Bits on Wires 

In Se c t i o n 3.2, w e noted thatin m ui^e f tdicennsd dw loartk <sn,the transit ti 
the w ires betw een ro u tin gs w itches often exceeds the rate atw hich n 
section,w e su gge sted thatw e could pipeline m ultiple bits on the w 
from beinglim ited bythe length oflongw ires. In m anyw ays,this tec 
pipeliningdata through a ro u tin gc o m p o4nl fe. h Wa tsh dpi ip © lsi siei d g,rw Se c t i o i 
can supporta data rate w hich is higherthan the transrn ission latenc 
or across the w ire.The effects ofw ire pipeliningon the ro u tin gp ro to c 
effects ofpipeliningon the ro u tin gp ro to c o 1. Lik e the routerpipeline, 
each w ire pipeline m us tib iea Hid a heeodt iim it hth em hi led in the reverse direc 
backw ard data is returned. Aga i nEuflW-HDLEdva <b a - ds ,unc h satsbtda enserted into tl 
return stream to hold the connection open w hile the w ire pipeline 
show s a turn scenario w hen bits are beingpipelined on the w ires be 

On e irn portant assurn ption is that the w ire betw een tw o com po 
num her ofpipeline registers. Ho w one ensures that this assurn ptic 
an irn portant irn plern entation detail. With a properlyseries term in; 
betw een routers, w e do nothave to w otl 1 yna^rt <bm tore fle tht eo wi arsn cThsee 
w ire w illlook.for the m ost part, lei kEsasi imy te i-d <k 1 iasy.t (Him a k e the t i m e -c 
appro xi m ate an integralnum berofclock cycles so thatitdoes beha 
registers.Abrute-force m ethod is to care fullycontrol the length and 
the w ires betw een com p o n e n ts .Am ore sophisticated m ethod is to 
drivers them selves to adjust the chip -to -chip delays u ffic i e n 11 y t o m e < 


81 



In situations where data transfer can occur with signihcantlylesslatei 
setup,it m aym ake sense to pipeline the openingofthe connection. T 
show sthe case w here a connection cannotbe established to forw ard 
portto the backw ard portuntiltw o cycles a ft e r the initial route reques 
the ro u tin gw ord and the w ord im m ediatelyfollow in git are notforw a 
router. On ce the route is setup, data flow s through the routers in the 
fashion w ith no addition aldelay. 


Fi gu re 4.16: Exa m pie o f Pbpneii bneet iio (5 Se t u p 
















































































T i me 



Depicted above is a turn occurringin a 3-s tage netw ork w here there is 
Ion gw ire betw een stages tw o and three. Si nee ittakes tw o clock cycles 
w ire in each direction,duringa turn the second stage router m ustfill 
w i trlATA-lDLE w ords w hile w aitingfor return data from the third stage roui 
end ofthe Ion gw ire. 


Fi gu re 4.17: Exa m pie Turn w ith Wire Pipelining 


Chapter 6 1 o o k Isl hit g tiegn h n i qupeps otntsthis ass um ption in further detail. 

4.12 Width Cascading 

In Section 2.7. lw e noted thatthenum be rofi/o pinson an ICislim ited 
We also noted that the num her ofavailable i/o pins is grow ingslow 1; 
Se c t i o n 3.6 w e pointed out that this lim itation re qu ires a trade-off bet 
channels in the netw ork,the rac|)i(XKDrfnh ,cahn do tih td adgid iitriioo ru Di ft g 
com ponent.To firstorder,the num berofi/o pins re qudi r leadt ifo m a s qu are 
d, and data eh an n ©d ssfcSf w- i/d- t/l)i Si nee the total num her of IC pins at a g 
technologyand costis a/qfp^cq # e ann is n; fo„a a ed a s : 

k pins = 2-r ■ w ■ d (4.1) 

Alternatively, w e can consider how to break the fu notion ofa single 
into m u 11 i p 1 e ICs .Ifw e can splitthe fu notion ofa router across sever: 


83 



beingable to build largerprpmntd net s oo u - tbi ry ^ c d) img able to use sm a lie r a 
cheaper ICs in b u ild in go u r ro u tin gc o m p o n e n ts . Of course,this segri 
bene fit i f w e can do itw ithoutincurringthe costofa large num berof 
constituent ICs . 

Width cascading is a technique for building routingcom ponentsw ith w id< 
com ponents w ith narrow data channels. Th is allow s us to build a 1< 
a large w idth,w ith o u t d ire c tly s a c ra ficciaa lgd - iiidg kca flkjk Haiila hiiu w e 
can c as®id* tin gc o m ponents to //a,q h,. iw \te iac lw god \tdir ned bythe follow in 
e qu a t i o n s : 


fcpins — Z • V • Wrouter 

Wcascade — ^ 1 ' ^router 

Of course,equation 4.1w asjustan appro xi m ation ofthe pin re qu ire m e 
onlyto the extent that the o ve rh e a d , in term s ofpins inter connectin 
com pared to the totalnum berofpins on the router. 

4.12.1 Width Cascading Problem 

Width cascadingexploits the basic idea thatw e can replicate the n 
spondingdata channels to achieve a w iderdata path. Th is re p lie a ti( 
onlyhave the desired effectofbehavinglike a w iderdata netw ork iftl 
tw o netw orks is identical. Thatis,the data launched at the sam e tim 
arrive atthe destination sim ultaneouslyor block atthe sam e stage in 
com plicated bythe follow ingaspects: 

1. Fa u Its m a yo c c u r a ffe c tin go n e copyofthe netw ork differentlyfrom 

2. The techniquesin use forselection am onga vailable.equiva lento ut 

On a m ore basic level, then, the problem becom es that ofensuring 
com prisinga sin gl e , 1 o gi calro u ter handle data identically. 

Our basic problem is tocsckdectrqiitars&i otm fu tu allyc o n sisten t state s .Th at i 
setofprim itive routingelem ents w hich ac tpaos nae snitn sgheo Iid lgi ccaol rr n a <t itn 
their crossbars such thatthe sam e forw ard ports are tran sferrin gd at 
Each crossbar router m ust, there fo re, allocate connection re qu e s ts i 
follow ingconditions hold,the routers be in m u tu allyc o n s is te n t s ta t 

1. each ro uter seeslilneeelinaiffi requests 

2. each rou terse o\d cneesc tthsicrequests in the sam e w ay 

3. e a c h router turns oorndnr© (ptaiencahtehe sam e tim e 

The routers m aysee differentconnection requests ifthe ro u tin gs p i 
to s o m e fa u 11. On ce the routers in the cascade see di ffe rent connecti 
channel, the routers m ayhandle the route differently. Ifanyofthe ro 


(4.2) 

(4.3) 


84 


assignm entofa backw ard port,the su b s e qu e n t ro u ter stage w ill n o t 
thatis,the logicaldata channelw illbe splitand parts ofthe data stre 
routers causingthese routers,in turn ,n o t to see identicalconnectio 

No n -dilated routers in the absence of fa u Its w ill route identicalco 
directions, as longas the routers w ere alreadyin a consistent state 
connection requests. Ho w e ve r, a faultinside the router ICm aycause a 
and m is ro u te data or open or close connections in a m annerincor 
stream . A1 s o , w ith faults the cascaded routers m aysee differentconni 
ifthe routers utilize their fro m endeoc hi oti® <d an it tiei ffe rent channels in the sa 
direction,the subsequentstage w illsee splittra ffic . 

With fa u 11 s , i t is possible that one portion ofthe data stream a p p 
connection w hen another does not. This can lead to som e ofthe ro 
and reclaim ingbackw ard ports w hile others do not.The routers w i’ 
since theydo notallhave the sam e setofbackw ard ports available to 
r e qu e s t s 

4.12.2 Techniques 

We can address the di ffic ulties associated w ith w idth cascadingbyu 

1. Identical control fie Ids 

2. Sh a r e d random ness 

3. Wi r e jdND in-use indication on backw ard ports 

4. End-to-end checksum s across logicaldata channels 

The firstthingw e need to ensure is thateach router in a cascade w i 
control fie Ids in the absence of fa u Its. Th is is easyto giRQUfffi, ntee bysirn \ 
TURN, DROP, a n <d)ATA-lDLE values provided to the w i d e , 1 o gi cal connection in 
from copies ofthe c o rre s p o n d in gcict irvfer doI uvt ionrglesl fonr fehnet sp. rWor m u s t a 1 
const ructthe w ide data suchptdimhBftEth erso tih td re g ano ©rvalues in its contr 
In the absence of fa u Its and dilation, this w ill eonnsru a - e td <a q It er quuetstre see 
and the sam e turn and drop indications. 

We w antasetofcascaded routers to nr akethesanr e random d e c is i 
connection request.To satis fy this re qu ire nr e n t , w ffipihnidatang or nr o 
com ponentfor “random "data.This externalizes the random ness so tl 
routers in the cascade. We can then nr ake sure thateach router nr ak 
ofconnections to available backw ard ports based on the connectio 
va lues. We call this technique ofusingsha r xhdr^ckmhdmmmsi fla> n d o nr i za t i o 
a vo id the need for extra com ponents to provide random bit stream 
we can a 11 ow eacbpoaiietnifi goc germ erate one pseudo -random bit stre a 
bitstream s can then be w ired to random inputs as appropriate w h 
n e t w o r k . 


85 



bus 


Fi gu r e 4.18: Ca s c a d e d Ro u t e r Co n figu ration using FourRouting El err 

The critical piece ofstate inform ation w hich needs to be m ainta 
am ong routers is w hich backw ard ports are being used byw hich fo 
directlycom paringthis state am ongrouters on everyclock cycle w o 
(i ■ o) b i t s am ongrouters each clock cysr hen. eNottiionngst ha a t mr ca sntycw o r d s 1 o 
and hence connection requests tend to occur in fr e qu entl y, w e can ap 
sim plycom paringw hich outputportsare a c tu a Ilyin use on each eye 
im perfectin determ iningallpossible faultyconnections.butitw ill c 
i m m e d i a t e 1 y. We can arrange for the router to recover from aim ostall 

To m ake the com parison.w e add a single extra pin per backw ard 
the router to indicate w hen the backw ard port is being used. We t 
correspondingbackw ard -p or tin -use pin sXMrfcc ca is figqdr a di o cn u Whres rin a w 
a router be gi ns to use a connection,it si gnals that the port is in use 
data and m onitors its in-use pin. The in-use pin w illonlybe asserte 
cascade agree that the port should be alloctatl endg tlfma ft etrh aeni m pi ps eo p r 
signalappears unasserted w hen a router has tried to allocate a back 
itis in an inconsistentstate from its peers,deallocates the backw ard 
In this m an n er, th e cascade w illonlydiffer in term s ofbackw ard p 
transient am ountoftim e w hen a faultoccurs.In m o s t c a s e s , ifth e fa 
be m i s r o u t e d , d i ffe r e n t backw ard port S^MDrwe isld Ideectteecdt Ain ad fhhuilst w ni de d 
clear its effects.Figu re 4.18 s h o w shp cwn fonut b iuos u it igih^jEi ® madi b eh da -r e d 
random nessschem e m ightbe connected to form a cascaded router 

The onlytim e a fault w ill fool this a p p r o x® m ia 4 icot m o ire s wa li © rb eni m $ t i p 
opened sim ultaneouslythrough the router. In this case itis possible 
w illbe assigned duringthatsam e cycle. If the fa u It causes a setofm 
connected to the sam e set ofbackw ard ports,but in a di ffe rent con 
w i r &Mto-w illbe m istakenlyindicate thatthe connections are valid.Asi 
fa u Its and connection requests occur in fr e qu entl y, w e do have som e 
occurs in som e stage other than the finalstage ofthe netw ork.itis gen 


86 



connections w illbe goingto differentdestinations.As a consequence 
diverge in a subsequentnetw ork stage. Wh e rkNDhwe yld o hdui kadrcgav, trh feh w ire 
connections. If w e are usingthe fastpath collapsing^ps cdr i h e d in Sec 
connection can be collapsed quicklyfrom the point o f d i ve r ge nee. Co 
quicklyrecover from this inconsistentstate. 

Ho w e ve r, i f t h e faultoccurs late in the netw ork or the spliced conn 
to the sam e destination,it is possible that a netw ork outputw ill sc 
the endpointw e can m ake use ofthe forw ard checksum to determ i 
at the endpoint is invalid. For this reason,the forw ard checksum in 
be com puted across the entire w idth ofthe logicaldata channel rat 
stream s . 

4.12.3 Costs and Implementation Issues 

The fir storder cost forsupportingthis w idth cascadingis: 

1. oadditionali/o pinson the router (one fo reach backw ard port) 

2. Severalrandom inputand,perhaps,one random outputpins 

Ext ernally, w e w illhave to place a pul kMDp bnns e aam dh \s fet hve i M hrae vl-to rout 
the shared random bits and the in-use ctctlni firg)t 1 M ie eosn Mi ® ridi e-u - T® lwt h £esdf 
AND to be s u ffic ientlysm all,the prim itive routingelem ents in a cascade 
each other. Si nee the state s ji a mi® ig bseitswdeoenne cw)Ahtlhs tt h he cvf id r e dt-h e 
num her ofcom ponents supported bya cascaldt^ristlHmritthdnby lihy e p h 
architect u ral features. On e reasonable option fo r 1 a r ge , h i gh -p e r fo r m 
cascade togetheras bare d bed ai if ea (fie ai Sli^kt Hhi p mint ii-c h iopdm 1 e w ill 
allow the shared inter connect to be veryshort. Of course,ifone is w i 
to the cascade and dediaraafih imopu ut t p fit ih id an ’a a n gu s e signals,itis possib 
to a vo id the electrical issues associ aAMtd w ith the externalw ire d - 

4.12.4 Flexibility Benefits 

Width cascadingallow s us to im prove the bandw idth w ithoutincr 
com ponent.Width cascadingalso im pUtiQnY1ni& to ryaanl kcmwi s is got mil ® tsesm (gey, 
to be transferred to and from the netw ork in few er clock cycles. Ad d i 
m ake narrow channelprpmni biivet a - a hi t>i w gicnogm sto take advantage ofthe 
resources forincreasingthe radixor dilation. In creasingthe radixw il 
T s , w hile increasingthe dilation w ill d e c re a s e contention and increa 

4.13 Protocol Features 

In Section 4.1 w e i d e n t i fie d five keyfeatures w e hoped to design into a 
In this section,w e brie flyMffiTRbeRw u It bn\^ PlrhD etocoladdresses these issues 


87 


4.13.1 Overhead 


MRP keeps overhead low bybeing m in im al.The onlyoverhead data ' 
data through the netw ork are: 

1. Routingw ords w hich determ ine the destination 

2. Forw ard checksum (s)used to insure data integrity 

The forw ard checksum s are untouched bythe routers in the netw c 
endpoints to use checksum s ofthe appropriate length to guarantee 
corrupt m essage acceptance forthe applicatiorti tThtas isai iffitd negpwt ords i 
bits to fullyspecifythe desired destination. 

Routingw ords directlyspecifya desired outputdirection so thatrn i 
to handle eachdnQ@mtingcrequest.Each routingdecision is s im pie,all 
occurw ith low overhead.End-to-end acknow led gm ents allow route 
connections. 

4.13.2 Flexibility 

By being m inirn aland m rp cc <a m hrei fets slid yadapted to a w ide range of 
le ve 1 p ro to c o 1 re qu item ents e ffic iently. The basic protocolis lightw ei 
detailed packet sem antics. Th is allow s applicatiopp<Q>irt<ninst^m net 
to im pose the packetstructurew hich bests u its them .Transrn ission 
ro u ter p ro to c o 1 an d a single connection m aybe reversed anynurn be 
e ffic ientdata transfers. 

4.13.3 Distributed Routing 

MRP has the desired distributed routingproperty. No globalknow le 
through the netw ork. The incorporated random ization even obviate 
know ledge o f fa u 11 s . At the sam e tim e,local recon figu ration can be u 
fa u 11 s w ithoutrequirirtigt^ hoy *ii m glne tahn a notion ofthe global state oftl 
Dilated routers allow the netw ork to m ake m ore e ffic ient detailed ro 
i n fo r m a t i o n . 

4.13.4 Fault Tolerance 

End-to-end checksum s allow corrupted connections to be detect 
coupled w ith end-to-end acknow le deggnc b ndt a tm a kras sa me i fc asi n ct<h east s fu 
transrn itted to the desired destination at lemdto ann icza t i Thne i q opna thhi n a i 
selection alongw ith m ultipath netw orks allow s the protocol to eve 
w hich exists betw een a source and destinat bonnnpe ac itri.oTh s fa b hi i byd d hs h i 
ends ofa connection allow s faultyconnections to be collapsed byei 
features com bine to guarantee that: 


1. An ydata corruption is detected 


2. An yfailed or corrupted connection w id 1c te e p 4 4 idi dbcfy <& m st i i rp aota dtm ve 1; 

3. Aslongasthenetw ork continues to com pletelyconnectallcom m 
can determ ine a fault-free path w hich allow s anyconnection to 1 

These features are guaranteed regardless ofw hetherthe fa u It is dyna 
even w hen fa u Its occur duringan ongoingtransm issio n . Fu rth er, sim 
can be integrated into the protocol to m ake itpossible to m in im ize 
identi fie d,static fa u Its on netw ork perform ance. 

4.13.5 Fault Identification and Localization 

Acorn bination offorw ard and backw ard checksum s alongw ith c 
pro vifdRe w ith the abilityto locate and detect fa u Its. Fo r w ard checks un 
through the netw ork to m ake sure that no faultw hich affects data ti 
Backw ard checksum s provide an estirn ate ofw here fa u Its m ayhav 
checksum s provide an at-speed indication ofthe data integritybetw 
inside the netw ork. Status indications help localieii iagrt ht^ro u tin gc 
actual path taken byany route in the netw ork. St at us inform ation a 
pipeline reversal slots w hich have no data to transm it other w ise. 1 
data adds no overhead to the ro u tin gp ro to col. Ad ditionally, since ei 
act u ally verify the integrityofthe received data,this backw ard status 
norm alope ration. 

4.14 Summary 

In this chapter, w e have described a source -responsible, pipelin 
protocol suitable foruse w ith m ultipa fclho nme i saa t>i o kn s, c Weu spal w d t tv ai t hr ae n 
to -end m essage checksum sand source -responsible retryled to a pro 
occurring fa ults. We also saw thatsuch aprotocolcould berealized w 
decentralized control. With som e sim pie optirn ization to the basic 
performance can be achieved even when fa ults existin the network 
protocolw as easilyadapted to accorn m odate severalpra gm a t i c op 
relative perform ance w e can extract fro m o u r i m plern entation techn 

4.15 Areas to Explore 

Thischapterhasdem onstrated theexistence and details o fa p ro to i 
perform ance and dynam ic fa u It-tolerance w ith low overhead.There 
further exploration relative to optirn izingsuch a protocol. Ad d i t i o n a 
s u gge sted in this chapter w hich could be further qu anti fie d. 

1. In Se c t i o n 4.7.3, w e noted thatthere are severaloptionsforw hen ani 
data. Th ere ism uchroom forunderstandingthe bfetsfa ttiha t e gy fo r re 
netw o r k s. 


89 



2. The routingschem e descri b oblMdus, th it h <a h n p tienrfdi sr m a t i o n a b o u t d o 1 
stream netw ork tr a ffic is used w hen selectingam ongavailablepat 
selected could be im proved ifrouters had som e inform ation abc 
succeedingin ocnintdn giao cn through aparticularbackw ard port.Fin 
s im pie and cheap w ayto get tim elycongestion inform ation bad 
rem ains an in te re s tin gp ro b le m . Further.utilizingthat inform ati 
destroythe fa u It -tolerantp roper ties ofthe protocolor ad verselya 
ro u tin gd e c is io n s can be m ade also rem ains a di ffic u It p ro b le m . 

3. In Chapter 3w e presented a ggr egate data for the perform ance oft 
m ultipath netw orks.Itw o u Id be w orthw hile to qu anti fy the relat 
each ofthe keyfeatures.Particularly, itw o u Id be desirable to quan 

(a) The bene fit gained byusingdilated routers in anetw orkrathert 
sized non -dilated routers in an extra-stage netw ork w ith com p 

(b) The bene fit offe red byfastpath c o 11a p s in gfo r vario u s n e tw orks 

(c) The bene fit offe red bym asking fa u Its as opposed to lea vingthem 

(d) The im pactofvarious pipeliningstrategies on netw ork perfor 

(e ) Flo w w ell this heuristic ro u tin gs tr ate gy stacks up againsta cc 
does have globalknow ledge ofallconnection attern pts in the 

4. The strategy presented here is based entirelyon circuit sw itching, 
to -end bene fit s and tim ely retries forlow -latencydata transrn issio 
nature ofthe protocol. Flo w ever,w hen the netw ork becorn es lar 
ofpipeline stagesrequired to getfrom source to destination,com 
typicaldata-stream ,itw illbecom elesse ffic ient.In such scenario 
longerto hold the connection forthe replythan to actuallytransm 
that in such a situation packet sw itch hmgpw; t> m lad r« ta ffieei et h fel ynbty r ( 
allow ingthe router resources to be used byother connections w 
pipeline to flush and refillin the reverse direction. It w o u Id thus be 
sw itchingschem e w hich offers com parable fa ult tolerance and 
bene fit s as this protocolw hile allow ingrn ore e ffic ient u se o fro u t 


90 


5. Testing and Reconfiguration 1 


In this chapter w e consider how to dealw ith failures in the netw o 
w e noted thatthe netw ork and routingprotocol allow us to c o n tin u 
anyknow ledge about w hyconnections fail. Ho w e ve r, w e also noted 
bene fit associated w ith m askingthe effects offaults.Further,itis use: 
the netw ork. In this chapter,w e develop reliable m ec h an ism s for ic 
and discuss their utility. 

5.1 Dealing with Faults 

The prim aryquestion to address is: “Wh atdo w e do about fa ults in tl 
the previous chapter,w e could do nothing. As longas the netw ork cor 
allcorn m u n ic a tin gp ro c e s so rs via som e p a th ,c o rre c t o p e ra tio n w il 
do nothing, there are a num berofim portantthingsw hich w e do not 

1. Ho w m any fa ults have occurred in the netw ork. 

2. If the com plete c o n n e c tivityre qu item enthas been violated 

3. Ho w close the accurn ulated fa u Its q n meet it\a tyhoelqutinegjn nejnet s 

4. Wh ich com ponents are faultyso thattheycan be repaired 

We established in Section 410 .d bhwa h nfa at si k si id g) e s have a perform ance be 
introduced in Section 2.1.Isom e com pmd idnegsmt ® cb eel i sm laa yt a dl (fn® m the 
netw ork. In such cases it is irn portantto determ ine w hen isolatio 
guarantee thatonce isolation has occurred the isolated processor (s 
uncontrolled w ay. 

Eve n though w e can operate oblivious ofthe detailedthldture offaul 
ha ve reason to be concerned aboutw here fa ults ha ve occurred and 1 
on s ys t e m p e r fo r m ance. We w ant the abilityto: 

1. Identifyfaultycom ponentand interconnect 

2. Re c o n figu re the netw ork to m ask know n fa ults 

Identifyingfaultycom ponents and interconnect is useful in several 
necessary for the system to recon figu re itselfto avoidcthssfanjlt^com po 
determ ine w hich units need physicalreplacem entw hen the system 
Ha vingan estirn ate ofthe fa ults in the netw ork is also necessaryto idei 

'Portions ofthis discussion w ere firstpresented as [De H92] 


91 


how likelyisolation w illoccurw hen new faults arise. We have a Ire a d 
allow s irn proved perform ance in the presence of fa u Its. Re con figu r a t 
forfacilitatingon-line repair. 

In order for this identification and recon figu ration to w ork reliab 
large -scale m uniprocessor,itm u s t fu notion: 

1. With o u t h u m an i n t e r ve n t i o n 

2. Withouttakingthe entire m achine outofservice 

In norm al operation,w e expect the system com pre saicn Ig (fauurl hr u 11 i p r c 
so nr e tinr e after it occurs and recon figu re around it. We e xp e c t t h i s to 
continuouslyw hilethe system rlit inpsr.cFo e t & Dge s-s vc la He hr m a y h a ve fr e qu e n 
o c c u r r e aigc eMT(TF o f 6 m in u te s as in Section 2.5), it is i n e ffic ient and oftei 
to bringthe entire m achine out ofnornr al operation in order to pe 
r e c o n figu ration. Fu r t h e r, w ith netw orks ofthis size and potential fa ul 
nr anage the recon figu ration reliably, nr uch less econo nr ically. 

Finally, w e expectour faultidentification and recon figu ration to be i 
are notrobustagainstfaults.theynr ayw ellbe useless w hen actually 
1 i a b i 1 i t y t o s ys t e nr integrity. 

5.2 Scan-Based Testing and Reconfiguration 

As introduced in Section 2.2, t h ealEEEes st a p <b artriri nt d a tr-y scan architecture 
[Co nr 90] is enr ergingas an in d u strystan d ard forconr ponentand boar 
over head,the standard allow sfunctionaltestingofconr ponentsand s 
Ad ditionally, the standard allow s com ponent specific registers w hi 
com ponentcon figu ratio n . 

Ho w e ve r , t h e IEEE standard TAP architecture has draw backs w hich m 
in lar ge-scale, fa ult-tolerantsystem s.The sin gu la rand serial nature 
critical,sin gle pointoffailure in the test s ys tern .Architects are force 
serial scan chains or to use nr anyshort scan chains. The form erallc 
affecta large nunr berofeonr ponents w hile the latter re qu ires signific 
nr anyscan paths. Further nr ore,the standard TAP architecture integra 
snr allportions ofthe system into diagnostic m ode w hile leavingthe 
norm al operation. As noted in the previous section,it is ine ffic i e n t t 
fr o nr norm a 1 o p e ratio n forfaultdiagnosis. 

5.3 Robust and Fine-Grained Scan Techniques 

In this section,w e present three si nr pie additions to standard sea 
t e c h n i qu e s t o be u t i 1 i ze d e ffe c t i ve 1 y i n a fa u 11 -t o 1 e r a n ti suect t idr g.rTdr: e basic 

1. Mu 11 i -TAP scan a re ha heccht up MBemn—e n t i s gilrtd p lm am; £te-ss ports allow ing 
the com p o naecnctet eshed from anyofseveralscan paths. 


92 



Fi gu r e 5.1: Me sh ofGridded Sc an Paths 


2. Po r t -b y-p o r t select porla o «eaa cbmponentcan be independentlydis 
Se c t i o n 4.9.1). 

3. Pa r t i a 1 -e xt e r n a 1 s javst q a-ne locehs canned in boundary-testm ode indepe 
operation ofother ports on the sam e com ponent. 

With these additions,a scan -based diagnostic and recon figu ration s c h 
robust.m inirn allyintrusive faultlocalization and repair. 

5.3.1 Multi-TAP 

Su p p o r t i itg pnl oi 4 aa s: te- s s ports on apsim ^het asm sim pie extension oft 
redundant resource and inter clbiip dee tat sitet-e a 55 . oWitfsho am eun tris scan 
c a p a b i 1 i t i eascccaens heed throu^hitpileyps fern ialscan paths. Tip bsnad hot w s t h e c 1 
to be tested and recon figu red even w hen there are fa u Its alongsom e o 
m u 1 1 i p 1 e TAPs o n a ps bmgi aiC,csman paths can be arranged so that a m ini 
com ponents are severed from thdtsp hen stce as li spya ttda nfa b . 1 F 0 r instance 
can arrange the scan paths in a system w ith dual -TAP com ponents su 
are on the sam e pairofscan paths. Th is guarantees thattw o faultysc 
one com pa) aiceenstsiirb le.Figure 5 . 1 sh(D\p< 3 lao gjyixd dh a <d ht h a s this property. 

Wh en addingredu ancdcaens ts stco pao cnoemt,there are several issues w hich 
addressed to assure us thatw e can realize the potentialbenefits o fh i 


93 



address the issue ofresource contention betw een the scan paths, 
cannot both perform a boundaryscan through the sam e com ponen 
alw ays have the abilit y tpoocnoennttr's fcaaciopiaths from a non-faultypath.Thi 
m ustbe able to m inirn ize orelim inate anypotentialfor inter ferencr 
can achieve these goals usingtw o sim pie techniques: 

1. Re source conflictresolution in favorofthe m ostrecentrequ ester 

2. Sparse encodingofscan instructions 

Conflict Resolution 

Presum ably, access to the scan paths is beingcoordinated at som 
everythingis w orkingproperly, there should never be a resource co 
Ho w ever,w e are concerned w ith ass uringthat reasonable behavior’ 
the system are notbehavingproperly. We gi ve each TAP its ow n instruc 
register.Theseregistersbehave exactlyasin a standard TAP [Co m 90]. Di ffe r 
arise w hen m ultiple TAPk a S he tih q) t 4 on e scan registers.This w o u Id oc< 
the d i ffe r e n t TAPs attern pted to load instructions thatreferenced the s 
sim pie con Hi ct resolution schern e is to give the TAP loadingan instruc 
the path. Wh en the new instruction is loaded,the instruction in anyc 
bypass instruction. Si nee each TAP has its ow n bypass re gister,there w 
to the bypass register.Assum ingw e can su ffic ientlym inirn ize the chai 
can successfullyload a non-bypass instruction into its instruction re 
fa u It -tolerance criterion. The schem e allow s a non-faultyscan path 
resources aw ayfrorn a faultyscan path.Figure 5.2 show s a possible an 
w ith tw o test-access ports. 

Sparse Scan Instruction Encoding 

Th e IEEE TAP protocol forloadinginstructions is su ffic ientlyinvolved : 
scan path from success fullyloadingan instruction in m ostcases. Ho w 
guarantee that faultybehavior w i 11 n o t i n t e r fe r ep w> n tdin rt .o Si nfa pldiyacce 
fa u 11 s , s u c h as s t u c k -a t fa ickt) sa ® m ttahsil l|icn cels |v ill p re ve n t a p a th fro m be 
able to load an instruction.Astuck -at fa ultin the da TIM, lrbfl))e s ordata-pt 
w illforce the dow nstream com ponent TAPs to see allzeros or ones,m 
the data lines to cause instructions w ith allzeros orones to be loade 
not the onlykind of fa u It our sys tern m ust contend w ith. Sp arse instr 
w ayto m ake the likelihood that a faultypath can load a valid instruc 

The basic idea in sparse encodingisto m ake the num beroflegalen 
to the num berofpossible encodings.The non -lega linst ruction enco 
instructions so that theycannot inter fe re w ith the norm al ope rati 
correctingand detectingcodes in com m on use fordata storage and 
are com m on exam pies ofsparse encodings.In this application,w e 
errors and preventingthem from c o rr u p tin gn o n -fa u lty o p e ra tio n , n ( 


94 



Each scan path has itsow n instruction register,bypassregister,a 
Each ofthese is identical to the their counterparts in a single-1 
The prim arydifference in the scan architecture as a resultofthe 
the instruction decode and con Hi ct resolution unitw hich re p 1 
decode unitin a single -TAP com ponent. 

Fi gu r e 5.2: Sc a n Architecture fo p Dm aeln-TAP Co m 

exam p 1 e , w e used a sim pie instruction enco xfcdbni g sc thhe ec ih seu \m h a <n h c o : 
a nn-b it data w ord,the space ofpossi iws h retire taishhw o p d k e so2f 1 e ga 1 

instruction d 1 o Mwseis s2su m e thatthe clock and m ode bits behave in e 
m annerto load in an instruction,but that the data lines hold rando 
code w ord gettingloaded are: 

D _ numberofle gaj c 2k d e^ s L) _ m 

* random-load — \ r ~ ^ i 

num ber oipossi t£l codes 

Of course,when choosinga checksum ,one should m ake sure thatt 
w ords are notlegal,checksum m ed instruction encodings. 

Me Hu gh and Wh etsel propose addingpari fiyiT§0]itnos i d re <n ttiicfync ei-c o d in g 
rupted instruction w o rd s . Sp arse e n c o d in gis a m oregeneralencodin 


95 







protection againstdata corruption. 


Costs 

Review ingthedual -TAPe xam pie show n in Fiigfui oen fa2l w oesst s a 6h a t tifca i eadd i 
w i t h a m u 11 i -fTAPic ® ml are: 

• Four additionali/o pins per additional TAP 

• On e additionalinst ruction and bypass re gisterper additional TAP 

• On e addition Mtatpuetjp ardt d i t i o n a 1 TAP 

• On e con Hi ct resolution unit 

• Ad d i t i o n a llttiJXleps ufd reach shared registerpath 

Form ostrn odern com p o n en ts ,th e lim ite d2r7el^ roaut hcea' iiM Mm pins (Se e Se 
area. As such,the additionali/o pins w illgenerallybe the first order c 
TAP controller. No te that the size ofthe con Hi ct resolution unit is prop 
the num berofpotentiallyshared resources and the num her of TAPs . 

Compatibility 

As noted above,in the fault-free c a s e ,ifb o th scan paths thro ugh a co 
access the sja mnee cn ct me gi s t leti ,-TAftc npmnentw illbehave identicallyto a st 
sin gl e -TAP com p o hte -TAP dVfpi nanents pladteicamadcbchrden on the softw are 
assure thatthe scan paths thro ugh a given com ponentnever attem pit 
In the fa u 11 y c a s e , a s longas there is a non-faultypath through a com p 
can be used as a standard TAP as longas the faultypath does notrn ai 
in s tru c tio n . As ta n d a rd single -TAP com ponentm aybeiita eTAPi n a s ys t e m 
com ponents.butthe single -TAP com ponentis susceptible to anyfaul 
controllines. 

5.3.2 Port-by-Port Selection 

Se c t i o n 4.9. b idnut ced the idea ofport deselection for fault-m asking, 
disablinga portin this m annerim plythatthe com ponentw illignori 
in w hich the port is disabled. This m eans the com ponentw ill not 
the disabled port,and the com ponentw illalw ays choose to avoid tl 
service.From the scan path.portselection/deselectipomsesnitm plyacc 
con figu ration register. 


96 


5.3.3 Partial-External Scan 

On ce w e have a w ayto s e le c five lyre m ove som e ports on a com ponr 
it m akes sense to be able to p e r fo r np a o s m t © sit ian g o nt ebayepho cr b tana s i s . Th 
capability gives us finer granularitycontrolover the scan paths allow 
on subsets ofthesystem while the restofthesystem remains in oper 

To supportpartial-external scan,the comt ipoamad n it a ter o fit 1 item fcs aari A le di 
at selectingthe appropriate subset ofthe noilh® nMkJiecs ui n dt h Ey-s can pa 
b o u n d a r y-s can padfesvs alrlybt e bypassthe portions ofthe norm albound 
notbeingscanned duringa particular partial-scan operation. 

5.4 Fault Identification 

Wh i 1 e d e s cMRtbi nn^p c tio n 4.7.3, w e n ® rt ei ob <t h iaot ncs could failw hen the p n 
appeared to be vi o 1 a t e d , w hen the destination rejected the incorn ing 
tion w as blocked. An yofthese cases could be indicative of fa u Its in th 
even in the absence of fa u Its and is the onlycase w hich doesnotnec* 
so rt h as occurred. No netheless,som e fa u Its m aym anifestthem selve 
the router checksum s,ifchecked,provide another indication ofw h 
occurred. 

Th e fa u 11 i d e n t i tie a t i o MRp imo tvhd g <sk hi y w illgive us an indication ofw h 
ha ve occurred and m ayprovide enough clues to narrow the search s p 
Ho w e witP.'s fault identification generallydoes not tell us exactlyw hat 
blocking m aybe due to fa u Its or due to high congestion. Corrupted 
in frequent, transient fa u Its or due to static fa u Its. It is often useful t 
it m akes sense to recon figu re to m ask out static fa u 11 s b u t it m a y n c 
in frequent,transient fa ults. Si nee the checksum data travels back al 
alw a ys possible thatanycorruption is inserted byarouterbetw een t 
source ofthe data. Aro u te r c lo se r to the netw ork inputcould corrup 
ofa router further aw ay. Sim ilarly, anyrou ter could be responsible f( 
from the destination. For these reasons,w e need a separate m echan 
ofeach suspectelem entin the netw ork. 

In the m ostnaive case,w e could m ove the entire system into testr 
boundaryand intdJrirt hel ss Coa tie katet h e i n t eqgmilijeocftei ®enr y ai d everycom pone 
In this m an n er, all structural fa ults in the interconnection can be 
com ponentfaults m atchin gM.iljrc ran no Id e Id (Set e t h® in e d . Real faultyw ires 
com ponentscan be differentiated from transient fa ults and congeste 
fa 1 s e faulttheories. 

Ho w e ve r,ifth e system is 1 a r ge , t h e im pactofrem ovingthe entire syste 
fortestingcan be significant.The largerthe system ,the higherthe rate ( 
and the larger the am ount ofhardw are thatm ustbe rem oved from 
s u ffic ientlylarge system s , i t is o ft e n neither econo m ical nor practical 
fr o m s e r vi c e . 

With the additionalsupport fl.jh weff iebaend ni na feec t hoent estingsigni tie anti; 


97 


intrusive.The addition ofport-b y-p ortselection and partial -externalsc 
ofscan testing. At a given tim e,w e can isolate a m in im alsubsetofth 
faultyand perform functionaland scan testing. By disablingthe ports o 
to a physicalsetofw ires and perform ingscan tests on justthose por 
can quicklydeterm ine the integrityofthe interconnectin question.Sin 
on com ponents connected to a given com ponent,w e can isolate the 
from the netw ork to perform functionaltestingon thatsingle com po 
the system m aycontinue norm alope ration w hile testingoccurs. 

This schern e provides a capability for fa u It -identi fie ation and local 
intrusive.The infer m ation ga i n e d fr o m this scan testingprovides d e t 
nature and extent ofsuspected fa u Its. With this inform ation, the sy: 
position to diagnose the extent o f fa u 11 s , p e r fo r m recon figu ration to £ 
risks associated w ith continued operation. 

5.5 Reconfiguration 

5.5.1 Fault Masking 

Wh en faultycom ponents or interconnections are identi fie d,the fan 
figu ringthe system to avoid the faultycom ponent. Aga in,the scan -based 
inter fa ce to this recon figu ration. Thptabe In tt^st aidagabfa 5)00 tmi ntroduce 
Se c t i o n s 4.9.1 and 5.3.2, provides one effective m e a n s o ffa u It a vo id a n c e . If 
le a vin ge ve ryp orton everycom ponentconnected to the faultycom po 
rem ove the unitfrom the functional portion ofthe system so thatitc 
operation. Si m ilarly, iffa u Its o c c u r in the w ire so, d nie^c itiocnrcrla a n inve i g o J 
disablingthe por ton allaffected com ponents w illeffectivelyexcise th 
s ys t e m . 

This m e c h an ism ofdisablingindivi dualports w orks effectivelyforr 
the sam e reasons itw as necessary for fin e-grained diagnosis.Ourm u 1 
to fu notion correctlyas longas there is atleastone enabled, non -fa ul 
nodes in the netw ork w hich need to com m unicate.The sem antics o 
com ponentw illignore the portthroughoutthe tim e in w hich the po 

5.5.2 Propagating Reconfiguration 

We should note that recon figu ration, both w hen excisingfaults and 
for testing, is best per form ed acco unting for the netw ork structure, 
w e end up rem ovingall ofthe backw ard ports in the sam e logicald 
router be cdscmh-eeds far anyconnections w hich are routed through itdest 
direction w hich has no enabled backw ard ports. We m ayw ish to e 
connections from the netw ork,as w ell.Thatis,w e use the sam e bas 
in Fi gu r e 3.28 to deter man efawu h tijcrlo liters should be excised alongw ith th 
to m aintain good connectivityin the netw ork. If w e do not recon figu r 
im pact fa 11s entirelyon perform ance.As longas pathiscsdidlsexistbetw 


98 


the ro u tin gp ro to c o I w illcontinue to find the paths through the netw 
r e c o n figu ration,itis possible for a router to select a backw ard port\ 
cannot route the data stream anyfurthertow ards its destination. Pro 
thism anner,prevents a router from everroutinga data stream into si 
also w orthw hile to note thatthe irn pactofnotrecon figu r i n g i n this n 
path reclam ation (Se di tpi <p kd i4tlfe2d i sFhgu re 5.3 show s an exam pie w here it 
sense to propagate the recon figu ration and m ask additional rou tinge 

We can also note thatitisnotalw a ys possible to propagate back and 
in the conservative m annersu gge s t e d ai bio nea\k ni bhdoeustfrsocni atth ie gpaadtok o r 
Wh en the recon figu ration algorithm showdru icne Hi gunr Sei2t8i w na S.5.&, tilt 
w as noted thatthis recon figu ration w as intended forthe harvestcas 
con figu re som e nodes outofthe netw ork. In the interest ofprovi din 
rem a in in gn o d e s ,th e algorithm m ayend up isolatingsom e nodes fo: 
netw ork. If w e do notw ish to isolate nodes from the netw ork,w e m i 
does not leave som e nodes disconnected from the intedwaolrk before 
routers s u gge sted bythe recon figu ration propagation.Figure 5.4 shows 
still exist betwnd epno aiiltsein the netw ork and the a l^S8rwt bmlcbhow n in 
s u gge st m askingcom ponents in a m annerw hich isolates nodes from 

5.5.3 Internal Router Sparing 

If a ro u tin geo m ponent provides sparingw ithin itself, the scan m 
r e c o n figu re the unit to sw ap spares w ithin the router. For som e i/o 1 
crossbar routers,the re m aybe plentyofadd itio n a 1 ro o m forfunctio 
size is dictated bythe pin -1 irn ited i/o. In these cases,it m aym ake se 
structures on the com ponent. Fa u Its inside som e partofthe rou tinge 
b yre c o n figu ringthe com ponentto use an alternate portion ofthe rou 

5.6 On-Line Repair 

The com bination ofacc urate ufapul fctdl ovc iatiMld^ttaooapbecr fo r m r e c o n figu - 
ration,allow s us to realize system s w here the fault-repair loop can 
mechanical inter vention,at least up to the faultlevelprovided bythi 
gram s m onito ringthe system integrityare ern pow ered to testtheorie 
the system to bestrn ask the effects o ffa ilu re s . Fu rth e r, w ith a know le 
m ents necessary for com plete system operation alongw ith an accu 
the m a c h in e ,p ro gram scan assess ove rail sys tern integrity. 

Wh en outside inter vention is necessary to repair the system ,thes 
disablingand port-based scan allow forin-operationp loenpel a tse m e n t . If 
into a physically replaceable subsystem are disabled, itis possible to 
w ithoutanyfurther interruption ofsystem ope ration. Of course,the ele 
ofthe system m ustalso be suita belg. fBamld leenr el^oln eSi anp ecnotnh putersyster 
[An d 85], St rat us fa u It -tolerant com puter system s [We b 90], Th i n k i n g Ma c ] 
On ce replaced, scan testingcanodiaitescrhioinea hhde fm m tee trim nal integrityo 


99 














Sh o w n a b o ve is another recon figu ration o f t h e n e t w o r k fir s t s h o w n in 
netw ork has sim i 1 a r dead -end path problem to the one show n in Figu 
cannotrecon figu re this netw ork to avoid the dead-end paths w ithouti 
the netw ork. 



Show n above is the result ofpropagatingrecon figu ration to a vo i d h a vi n { 
As s u gge sted,this propagation results in the isolation of node 6 ft o m the 


Fi gu re 5.4: Pr o p a ga t i n g Re c o n figu ratio n Exa m p 1 e 


101 




















rep laceplocro am t . Wh e na teheemr ept is properlyin stalled and identified as 
disabled ports into the replaced subsystem can be re-enabled allow 
fullservice. 

5.7 High-Level Fault and Repair Management 

For several reasons,it is desirable to coordinate testingand reco 
Particularly, som e central coordination isrequired for: 

• Ne t w o r k In te grityAs s e s s m e n t 

• Re c o n figu ration Planning 

• Coordinatingscan paths 

• Co 11 e c t i n g fa u 11 d a t a from m ultiple sources 

Wi th integrityassessm ent,w e w antto determ ine how safe itis to c 
w e w antto know the likelihood thatw e w ill sustain a faultin the ne 
m achine in o p e ra tio n a 1 o r re qu ire serious recon figu r a t i o n . In a yi e 1 d b 
w e sim plyw antto know the likelihood w e w illretain com plete con 
a har ve st fa u It-tolerance environm ent.w e m ayw antto know the like 
in a period oftim e. Ifw e com bine the know ledge ofthe netw ork to 
know n fa u 11 s , a n d a m o d e 1 o ffa u lt-o c c u rre n c e s ,w e can m ake these 

Integrityassessm entscan be used forseveralpur poses. In the sim p 
allow s hum an operators to assess the dangerlevelassociated w ith 
case.itcould be used to schedule dow n-tirn e forphysical repair. Th 
feedback into the ru n-tirn e system and tune operatioiaodordin gly. F 
ofcom plete netw ork failure increases, a system m ayw ant to check 
frequentlyto m inirn ize the irn pactofthe failure.In a harvestsituation 
certain dangerlevelatw hich itbeginsto evacuate the com putation ai 
node. If a node can be evacuated before itbecom es isolated,the cost 
node situation can be avoided.Ifa dan gerleve lean be chosen such 
evacuated before isolation,w e m aybe able to getaw ayw ith sim pier 
isolation. Si m pier strategies generallyrequire lesshardw are support 
operation. 

Ac t u a 1 r e c o n figu ration based on the results ofintegrityassessm en 
level coordination. As noted,a centralnotion of fa u Its is necessaryto 
r e c o n figu ration as described in Section 5.5. 

As noted in Sectiolrt i5IAlR me an paths w ill re qu ire central control to a 
resource conflicts.High-levelcentralcontrolis necessaryto utilize th 
r e c o n figu ration. Forlarge-sys tern s,w e w o u Id like to avoid a single-poi 
havinga singlepntrifcy bel e for allscan paths. The fact Che aetsw 'ei ii a <vea ne d u n d 
paths to anycorn ponentin the netw ork,allow s us to distribute the ( 
tolerate the failure ofscan controllers in the sam e w ayw e tolerate tl 


102 



w e allow the scan path controls to be distributed,their action w ill r 
w ell. 

Wh en decidingw hat faults are worth pursuing, it can be usefulto 
m u 11hpoldes.For instance,w hen pursuingtheories abouta faultynetw 
connect,it m aybeinsightfulto in tegrate fau ltd ata fro m severalnodes\ 
orinterconnectin question. 

From a fa u 11 t o 1 e ranee perspectives e w antto distribute the high- 
The easiestw ayto m anage this is to choose som e subsetofthe p r o c ( 
perhaps allofthem ,and assign thesldimcgliegh t-heev4 4 £ cksd ifir gtannt d or e c o n fig 
uration.The control ofscan paths could be distributed evenlyam on 
on the applications e could either dedicate a setofnodes to perfoi 
sim plym ake thispartofthe w ork perform ed byeveryprocessorin th( 
to interconnectthese coordinators as w ellas connectingthem to the 

The coordinatingnodesw illw antto m aintain a replicated,distril 
fe stations,sys tern integrity, and current system con figu ration. Re plica 
ofthe coordinatingnodesm ayfailorbe isolated from the netw ork.Tl 
depend on the fa u It-tolerance re qu ire m ents fo reach particular syste 
identi fie d,this inform ation needs to be shared am ongthe coordinati 

5.8 Summary 

In this chapter, w e exam ined the integr ation oftest and recon figu r 
tolerantnetw orks. We began byreview ingthe m otivation forfaultlocal 
addition ofm uht pppl@ fR\I£>so rt d e se le c tio n ,a n d partial-externalscan to : 
te s tin gap p ro a c h e s , w e deve loped robust m echanism s to support f 
r e c o n figu ration . We describe how these mechanism sallow scan -b as 
to proceed in a m in im allyintrusive m anner, allow ingthe portion o 
or r e c o n figu red to continue norm al operation. We also summarize 
facilitates detailed fault identi fie ation, sys tern recon figu ration, in te gi 
repair. Fi nail y,w e sketched how the control and data m anagem ent i 
can be integrated w ith the m ultiprocessingsystem ,itself. 

5.9 Areas To Explore 

In Se c t i o n 5.3.1, w cb ui n fcrtili iet ^lla ach <p gd m e n t t o e xilst iplie mcia n paths. 

The m ultipath scan a b d> lpi tpy giivfeus mist y hoe w ire the scan paths in aw ayw h 
the effects o f a n y fa u 11 s . Th a t i s , fo r a given num berofscan paths,w e ca 
ofcom ponents to scan paths w hich isoleafiehtflaeiFthe ^es It. (D<6 am p o snee, n t 
the scan paths are usefulto us onlybecause theyallow s us testand 
underlyingnetw ork. The red u n d a n c ystru c tu re oftha cnceot w not r k s h o u 
when selectingan assignment ofcom ponents to scan paths. For ins 
be wise to assign all the components in a single stage ofthe netwo 
Consequently, w e also w antto optim ize the assignm entofcom ponen 


103 


w hich m in im izesthe effects w hich scan -path faultsm ayhave on net' 
w e ass urn e a com po naecnctevs idibclhe ivs an set a n paths is faulty, w e w antto r 
effects ofscan-path introduced fa u Its on netw ork connectivity. Consi 
ofourscan paths,w e litniQ g/p hhya it ceaxj) 11® calityw hen w iringscan paths 
Bykeepingscan path connections physicallylocal,w e keep the costi 
down and keep the reliab nlmt^ootf h h gh i nWh lecn looking for good assign 
ofcom ponents to scan paths, w e w o u Id also like to d bstpy.1 o i t a 1 a r ge 
Determ iningreliable and practical assignm ents ofscan paths on to j 
re d u n d a n c yc h ara c te ris tic s rem ains an in te re s tin gp ro b le m to stud; 


104 


6 . Signalling Technology 


In the previous chapters w e fo cussed on the architectural and organ 
robust, low -latencycom m unications. In this chapter and the n e xt \ 
physicalaspects ofhigh-perform ance netw ork design.In Ch a p te r 4w h 
protocol at the data link and netw ork levels (Se MRFicgo u d ch. fe)xivs 6 p o i n t e 
on top ofanynum her ofphysical transport layers. He r e , w e address 
practical and eco hlcbimg dcis t b pi ^rijap dotssuch ro u tin gp ro to c o Is w ith m 
transit latency. 

6.1 Signalling Problem 

Ou r prim arygoalin selectinga sign allin gdisc ip linpasi tar t s a n s m i t c 
w ith m inirn urn latencyw hile affordinghigh W t qannddwt d ds tch mTh e Xtrea m ts i t la 
the com ponentinpu t{ 0 pvu ti ]b ludt d p teenndc p, n the design ofour signallingdi 
w ellas the basic technologies available forim plem entation.Since w 
com puter netw orks w ith inter-router delays w hich grow as the syst 
w ith signallingoverpoterotriaildylolhngdn.t e^B ca consequence,our interc 
m ediurn w illbehave like a transrn ission line,and our signallingpro 
line design problem . 

6.2 Transmission Line Review 

In this section,w e review the salient features of tWD84J fiorission line i 
a thorough treat m entoftransm ission lines. 

Form ostphysical interconnection m edia in use in digitalsystem s 
ation and phase distortion can be ignored.Ifw e furtherignore the ba 
there are tw o prim arych arac teristic s w hich des cimpbdamse transrn i s s i < 
a n [Propagation velocity. 

Th e p r o p a ga t i o nr>, ve H ca <r a tcy.t erizes the speed at w hich electrical w a 
interconnect. Wh en the voltage across a transrn ission line changes a 
propagates d o w n the t r a n s m vi Th ieo pi r Id p a gjatt ibraivalecityis determ ined 
m ate rials in the interconnectand is given by Equ ation 6.1. 

1 

v = —— 

\Ziit 

Fo r m o s t m p fce/roj w 1 h e//f) ® s the perm i 11 i vi l ycoe f.f£e ei vqpn tionalprinted-circu 
boards (PCBs <) fc a,v^, w h eey k 4a n ^is the dielectric constankn fifive e space.! 

1 

x/Pod) 



105 


w h ecri s the speed oflightin a vacuum .Standard PCBs , tHrcuis t Hr aaM a prop a 
the speed oflight. 

The characteristic transrn ission line im pedance arises from the d 
pacitance betw een the si gn a 1 conductor and its associated ground 
ofthe interconnect is fairlyconstant,the distributed inductance an 
as w ell.The distributed inductance and capacitance give rise to a ch 
i nr p e d Wda ,i 1 e transient voltage w aves are propagatingdow n the segn 
1 i n £9 d e fin e s the transient voltage to t vZpA ss he fill tnocut iff m not frtdrtd oc.onductoi 
georn etryand the dielectric nr aterialsinvolved.Fornr osteon ventiona 
connect ge o as 5© he l^Q. Zp = 50£2 is a conventionalstandard for PCB in te re 

and high-perfornr ance cabling. 

As longas the interconnect presents a uniform characteristic inr ] 
w ave propagates alongthe int a>r Reoanl ri b tet ractovo iroecci t.yhow e ve r , i s notinfi 
long, so w e nr ustconsiderw hathappens to the signalw hen itreach 
line. To understand w hathappens,let us considerthe nr ore generalp 
ourpropagatingw ave encounters an inr pedance discontinuity. Wh e n 
encounters an im pedance discontinuity, the discontinuity gives rise 
o fd isc o n tin u it y, part ofthe voltage w ave nr aypropagate through the di 
nr ayreflectback to the source. In generalw hen w e have S/n i n c i d e n t vo 
from a region ofinterconn eZ^ I ov ai trte giro p avd iat h dSp t ]h © dr a fleccet e d and 
transrn i 11 e d vo 1 tlq.'genvil/q ms ,go ve rn e d by Equ a t i o n s 6.2 and 6.3. 


Vr 


\z x + Zp) 


Vj 


( 6 . 2 ) 


V T = V R + V,= (6.3) 

We can see from these e qu a tio n s th a t,ifw e w antoursignalsto be tran 
points, w e need to keep the characteristic inr pedance ofthe interco 

If transrn it a signal from a driver at one end cofpapw) snfeet e nadr.et h e i ve r 
sign allin gis pomtXtbepdint . Point-to-pointsignallingcan be contrasted to t 
signallingw here th e re c aerieve a'® va nadl 1 e ve r a 1 potential d r i ve r s . Th e p o 
s i gn a 11 i n g c a s e i s usri rah epr Iset n hod . Sindleirtlgbstignaen routers is betw een a 
forw a rd and backw ard port pair, w e w illfocus the re nr ainderofou 
s i gn a 11 i n g. 

In a point -to -point si gn allin gsituation,w e ge n e ra llye n gin e e r the w 
acteristic inr pedance for its entire length. The endpoints ofthe tran 
o u r p r i nr a rye o n c e rn . Figu res 6.1 th ro u gh 6.6 show an incident voltage w 
reflection scenarios dependingon the ratio ofthe term ination resis 
inr pedance. If the end ofthe transrn is sa o( rf te (4 l n>e> iZp )o Ip gum -a: £>r2c), u i t e d ( 
the reflected volt Ypgd swt la ©e s a nr e as the incident w aven dTlp © irre t e i ve r at 
thus sees a vb/lfbal^eo 2/ ingthe arrival ofthe voltage w ave. Wh en the trar 
is s h o r t -c i r/m.u I Fq, 4,. (<< Zp), Fi gu r e 6.6), the re fie c te d vdJJ.tiasgthw a aem e 
nr agnitude as the incidentw ave b u teocpeji VEa'ikears p o taard t^. a^lni i w a ve . Wh 


106 



Position along Transmission Line 


AttiirtfieOthere is a voltage tran sW/ta <b tih fe os xm uGtoe end ofthe Iran sm ission 
line.Show n above is the voltage pro til e alonga tran sm ission line interc 
fir s t tran si t<£i@n<et <£ —). 


Fi gu r e 6.1 i la a 1 Tr a n s m ission Li n e Vo 11 a ge Pr o fil e 


® 2V 


0 L 

Position along Transmission Line 


Show n above is the voltage pro fil e alonga transm ission line interconne 
w a ve show n in Fi gu r e 6.1 e n c o u n L|e,r m s p £ tih ic ifacrueili d ofthe 

transm ission line.The voltage pro fil e show n is characteristic ofthe lin 
tran sit tim e across t h'.e i^n<t d <c=4)n n e c t ( 


Fi gu re 6.2: Tr a n s m ission Li n e Vo 11 a ge : Op e n CircuitReflection 


the term ination resistance exactlym atches tils, e rifo J,s s i o n line in 

Fi gu re 6.4). there is no reflected w a ve. If the term ination resistance is sli 
the transm ission line im pedance.the reflection w illtend betw een tl 
and 6.5). It is im portantto note that the reflected voltage w ave w ill ret 
the transm ission line and encounter the sam e reflection scenario v 
defined bythe im pedance seen at the source end ofthe transm issio 
continue to propagate alongthe transm ission line untilthe line read 
defined bythe e ri.d proihfess-teadystate.the voltage levelalongthe w ire w 
voltage w hich the w ire w o u Id possess ifthe transm ission line w ere 
For hi gh -speed sign allin g.w e w ant to engineerthe term ination to a 
tions.Thatis.w e w antthe des ttitiea tooth end prce icnttvbol fcaege as quicklyas p< 
and rent ain there. Tw o com m on m ethodsforachi eHiim ggtahries go a 1 fo r j 


107 











® 2V. 

cn 

nJ 


0 L 

Position along Transmission Line 

Show n above is the voltage profile alonga transm ission line interconne 
w a ve show n in Fi gu r e 6.1 e n c o u n t e r s an hi g iZ& r fn ijriXp) adance term inatioi 
the farend ofthe transm ission line.The voltage profile show n is charac 
duringthe second transittim e a i.e. r4 Ss t <h=4).i nterconnect ( 

Fi gu r e 6.3: Transm ission |e c t i o n 



® 2V. 

cn 

nJ 


o 

> 


0 



Position along Transmission 


Line 


Sh 

0 

w 

n 

a 

b o 

ve i s 

w 

a 

ve 

s 

h 

o w 

n 

i n 

th 

e 

fa 

r 

e 

n d 

0 

fth 

a ft e 

r 

th 

e 

fir s 

i t 

t r a 


the voltage profile alonga transm ission line interconne 
Figure 6. len counters a m at c JLe, ? d=i fffOpi (dance term inatic 
e transm ission line.The voltage profile show n is charac 
nsittim e acr d.<s st ]>h|e). interconnect ( 


Fi gu r e 6.4: Transm ission Line Voltage: Ma t c h e d Te r m i n a t i o n 

series termination a n parallel termination. Fi gu re 6.7 show s a parallel term ination arra 
and voltage profiles at both ends ofthe transm ission line w hen the d 
parallel term ination,w e selectthe driverso thatitcan drive the trans 
age and selectthe term ination resistance m atched to the transm issi 
(Z/, = Zq). Avoltage w ave origi nates atthe driverand takes one transm i 
to a r r i ve at the r e c e i ve r. On ce the receiver sees the voltage w a ve , t h e 1 
voltage until the next transition occurs. Figure 6.8 show s a series term 
is selected to drive the line voltage to one-halfofthe desired voltage 11 
equalto the line i/bpp e=dZ§)pacned( the receiver is left I> >c/i'(p.c u i t e d ( 

Fie re the one-halfvoltage w ave arrives at the destination and reflects 
thus sees a full-sw ingtransition w hen the one-halfvoltage w ave a rri 


108 










® 2V. 

cn 

nJ 



Position along Transmission Line 


Show n above is the voltage profile alonga transm ission line interconne 
w a ve show n in Fi gu r e 6.1 encounters a 1 o wZft,i; ? ,i<tiZp)a d fihnec e term ination ( 
far end ofthe transm ission line. The voltage profile show n is charade 
duringthe second transittim e a i.e. r4 Ss t <h=4).i nterconnect ( 


Fi gu re 6.5: Transm ission <V&yRa |e c t i o n 



Position along Transmission Line 


Show n above is the voltage profile alonga transm ission line interconne 
w a ve show n in Fi gu r e 6.1 e n c o u nZt,e~,r> s<at £?4i) ottteerfairietrfd ofthe 
transm ission line.The voltage profile show n is characteristic ofthe lin 
transittim e across t h'.e i^n<t d <c=4)n n e c t ( 


Fi gu r e 6.6: Tr a n s m ission Li n e Voltage:ShortCircuitReflection 

tim e a ft er the source drives the transm ission line.The reflected w ave 
the transm ission line one transittim e later or one round-trip transi 
originalone-halfvoltage w ave. Wh en the reflected w ave arrives atthe s 
Z drive and no further re flections result. 

6.3 Issues in Transmission Line Signalling 

No w thatw e have review ed the keyfeatures associated w ith transm 
consider the issues associated w ith hi gh -speed.point-to-pointsign a 
line design.Byusingseries or parallel term ination,w e can control th 


109 












Show n attop is a parallel-term inated Iran sm ission line. Be low the trai 
voltage profiles seen bythe source and destination ends ofthe transm : 
d r i ve r fo re* sVa rChnsition atthe source. 

Fi gu r e 6.7: ParallelTerm inated Transm ission Line 


transm ission line so thatthe destination settles to the desired volta; 

7/ = - (6.4) 

v 

As show n in Equ a t i o n 6.4, the transit time depends on t fi,e 1 e n gt h o f 11 
and the rate ofsigna 1 rp Ftoopna gEqd <a ti.o n 6.1 w e know thatthe rate ofpro 
depended on the properties ofthe m aterials.Form ostconventiona 
v « §. Hi gh -p e r fo r m ance substrates w ith slightlyhigherpropagations s { 
and reliabilitycurrentlylim its their use to sm all,hi gh -end designs. 

Akeyissue to guaranteeingthatthe destination end ofthe transm i: 
desired voltage levelin a single transit tim e is proper term ination.In 
term ination cases,w e require a term ination w hich ism atched to the 
irn pedance. Process variation in the m anufacture ofprinted-ciruit b 
plicates the ease w ith w hich w e can achieve m atched term ination 


110 












Show n attop is a series-term inated Iran sm ission line. Be low the tran 
voltage pro fil e s seen bythe source and destination ends ofthe tran sm 
d r i ve r fo re* si ’ a rQi n s i t i o n . 


Fi gu re 6.8: Se r i a 1 Te r m inated Tr a n s m ission Li n e 


w ill only guarantee the im pedance ofthe E£B%i gfi ghl tiffcurcnis nwd fit h i n a b o 
can be specified but a 1 w ays at higher costs. Ad dition ally, there is the 
term ination is fabricated. <Ext airtnrae ls,issut ® iiasc ® mr esistorpacks can be u : 
m ination w ith m oderatelyhigh accuracy. Flo w © \te n ,tt e Dim ip ta Jiimmt fo r a 
such as aroutingcom ponent,can r eR^iBi iTh ias st imrb si ha fi e es a ro tro tchoes t fo r t h 
term ination com ponents.forthe PCB real-estate,and forthe added co 
space required forterm ination also translates into larger distances t 
longertransitlatencyand low er reliability. Ext e r n a 1, fixe d resistors als 
to reverse the direction ofsignaltransm ission across ourtransm iss 
reverse an open connection in ournetw ork. 

An other keyconcern w hen drivingtransm ission line inter connec 
drive the tran sm ission line.The pow er supplied bythe driveris d e te 


111 









and resistance seen bythe driver (Equ a t i o n 


6.5). 


R 


drive — 


(A T-w) 2 
R 


(6.5) 


Aparallelterm inated transrn ission line w 
the transrn issFo n line to 


R 


par allel-drive — 


v 2 


ill d is s ip a te p o w eras sho 


( 6 . 6 ) 


Aseries term inated transrn ission line w ill dissipate pow ergiven by 
trip transit tim e follow inganytransition. On ce the reflection returns 
dissipated in the stead y-s tate condition. 


R 


serial-drive — 


v 2 

2 Zo 


(6.7) 


Ad ditionally. pow erisrequired to charge the capacitance associated 
gi ve n b y Equ a t i o n 6.j8iw tHieerfe equencyatw hich t hM,/d t ,f,i ive thsav vbtfctk ges , 
sw ingdrivinginto th (e'^duivih® f haencda pacitance w hich m ustbe charged o 
to change the voltage on the driver. 


1 ^ 

P'charge — 7 ^^driver (AV driver )-/ 


( 6 . 8 ) 


6.4 Basic Signalling Strategy 

To m eetthe needs ofpoint-to-pointsi gm aclcleipgwbilh p (bgfei ®ep ,ewe d a n d 
a series-term inated, low -vo ltage sw ingsignallingschem e w hich u s ' 
feedback to m atch term ination and transrn ission line irn pedances. 
discussion,w e CfoOS iins te grate d circuittechnology. 

Lo w -vo ltage sw ingsignallingis dictated bythe need to drive the res 
load w ith acceptable pow er dissipation. We see in Equ a t i o n 6.5 t h a t 
sw ingsaves pow erquadratically. In the designs w hich follow , w e spe 
betw een zero and one -vo lt.Lirn itingthe voltage sw ings to one -vo It s a vr 
o ve r traditional five -vo/let s/i,giy,ja4 rt=g250m W w ith five -vo ltsignalsw ings and 
Pseriai-drive = 10m W w ith one-voltsw i n gs ). 

To achieve one -vo It s i gn a 11 i n g.ovnee pis vidfehcao ran e -vo ltpow er supply 
purpose ofsi gn allin g. Th is freep bhnee imtd ifviodnm anl ec eodni ngto con vert betw 
logic supplyvoltage 41nirig toel tsai go he ve 1. An ypow erconsum ed generating 
supplyis dissipated in the pow er supply, and notin the individual ICs 

Series term ination offers severaladvantages over parallel term ina 
n a 11 i n g. We can integrate the term ination irn pedance into the driver.Ir 
w e needed to drive the transrn ission line uopl p al ger ac ilia Th et oe fffe e ta i/gn alii 
resistance across the driverbetw een the supplyrails and the driven t 
com pared to the transrn is s/fp o imldnr<4eimtp0ddrii\rectda,e transrn ission line 
close to the s iugp p lllyi (iSq> s F6.fi). ih® aCMOS irn plem e n ta tio n ,th is m eans thattl 


112 



Show n here is CtshiOS thr ai h hem i s s i o n line d r i ve r. Sh o w n atrightisthe basic c 

Shown on the leftisa simplified model ofthe driver m akin gexp licit 11 

transistor.w hen enabled,can be m odeled as a resistorofsom e resista 
t r a n s i s t o r’s W/ Lratio and process param eters. 

Fi gu r e 6dklOS Tr a n s m i s s i o n Li n e Dr i ve r 

ofthe transistors irn plern entingthe finFK/1 Ld a i v© r t m rut sat Hr ea tvh a tl a s gs t a n c t 
srn all. As a consequence,the fin a 1 d r i ve r is large and,there fo re,has ct 
p a c i t a (n'dfitq rr This m eans itw illtake additional tim e to scale the drive 

s i gn a 1 ti p 1 a r ge e n o u gh to d r i ve the fin a 1 d r i ve r . It a 1 s o Ffya, r § e a n s that the 

(Equ a t i o n 6.8), w illbe large.In contrast,the series term inated driverca 
driver. The higher im pedance ofthe series term inated driver allow s 
s m a HI y k ratio and hen £ e , js nxt a hi dess latencydrivingthe output. 

The series term inated con figu ration gives us the opportunityto use 
chip,series term ination to m atch the transrn ission line im pedance. 
line im pedance and the conductance ofthe drive transistors to vary 
m onitoringthe stable line voltage duringthe roiitrid ktrritpnt i atri cs fit 4 itm e b 
the source end ofthe transrn ission line and the arrival ofthe reflectio 
w hetherthe d r i ve r term ination is high,low ,orm atched to the transi 
a properlyterm inated series transrn ission line.w e expect the voltaj 
gro u n d and till ien gifgip dyduringthe fir stround -trip t r a ntst 11 hi m a clfit he vo 11; 
above the halfw a yp o in t,th e drive im pedance is too low .If the voltage 5 
w a yp o in t,th e driverim pedance is too high.Bym onitoringthe voltage 
the drive im pedance appropriately, the integrated circuit can com p 
both the silicon processingand PCB m anufacture to m atch the term i 
im p e d a n c SCM(M® sic u it designers are fa m ilia r w ith the practice ofdes 
com pensate forthe w ide variations associated w ith silicon process 
technique takes the strategyone step further to com pensate for var 
externalenvironm ent. 


113 



Functionally, w e w an tan adjustable resistance forthe path to both the h 
rails. To drive the outputto a particani airesci git la 41 arp gtr lad g, wi ae t e rail to 
the outputpad via the tuned im pedance. 


Fi gu r e 6.10: Fu n c t i o n a 1 Vi e w ofControlled Ou t p u t Im pedance Dr 


6.5 Driver 

To allow ad ju stable driverim pedance,the outputdriveris designed 
line on the a chip’s outputp aidp pol jt hher o iuggha hi icnogis tro liable im pedan 
Lo gi call y, this con figu ration is show n in Fi gu r e 6.10. Sd Veir g i hoep t i o n s e xi 
variable output im pedance. Kn i gh t a n d Kr ym m s u gge st controlling! 
controllingthe gate voltage on the fin[KK8&]t £6gee cFiighpr e t6dl F)i \Brrasn son 
s u gge sts u singe xponentiallysized pasilgm t^.ps pbla 4 w acre ah Hike a irghpau t p a d 
[Br a 90]. The im pedance is controlled byonlyallow ingthe approprial 
turn on to achieve the desired im pedance. Gabara and Kn a u e r s u gge s 
equivalent usinga set ofexponentiallysized atcr h nosfi felt ® rp ti lh 1 -jx y r a hid 1 i i 
pull-dow n netw orks to allow d igita 1 c o n tro 1 o fth e o u tp u t im pedanc 

De Flo n , Kn i gh t, a n d Sim on consider a variant that places the im ped 
and the gatingtransistor in series betw een the ^CL^S9S]1 (Seuep p 1 y a n d t h i 
Fi gu r e 6.13). Th i s con figu ration achieves low er latencybym ovingthe im 
outofthe critical si gnalpath through the outputpad. Un like the o t h 
im pedance schem es,the im pedance settingis controlled separatel; 
static duringoperation.The generation ofthe pull-dow n and pull-up 
m ustbe perform ed in the signalpath to the finalstage ofthe pad drivi 
im pedance controldevices do notchange w ith each data c yc 1 e , 1 e s s 
the fin a 1 s t a ge J^i^e r s > 

In allofthese driver schem es,w happhjeih agM s vgo a ltlhi n eg ss h o 1 d drop 


114 


Vc o n t r o I 



Sh o w n a b o ve is a voltage-controlle dMDSdnrtiive flf<r d Hi aBy c e 

va r yi Hi g,„ below the logic supply voltage,one varies the gate voltage appl 
finaldriverw hen i t i s enabled. Mo d u la tin gth e gate voltage in thism anne 
conductance ofthe fin a 1 d river and hence the im pedance seen bythe ti 


Fi gu re 6. IdMOS Dr i ve r w i t h Vo 11 a ge Co n t r o 11 e d Ou t p u t Im pedance 


more below the hi ghN-doegyicc s is pc pi hy,b e used to fo r m the pull-up n e t w o 
the pull-dow n netw ork. Th a t i s , w hen the internal logic can drive tl 
fin al d river m ore than a threshold abo veiitjh p k^.a Is tbreecdodnri g^tsssiagnyia 11 i n g s 
to us ©-da evice pull-up to allow the output to sw ingall the w ay up to t 
NMOS devices have si ze,speed,and pow er advantages. Si ho ai the m obil 
tw o and a halftim es the m N-db a Mteyeo Miidheasgawn transconductance,at 
im pedance,can be roughlytw o and a halftim e-d es ni c ael Ive ii ttlh titnea corr 
sam e transconductance. The srn allerdevices presentless capacitan 
internal logic and hence operate fasterw hile dissipatingless pow e 
driverlayoutbecom es sm allerand s im pier since tshdee Mcadsd. river is b 
Ou t p u t d r i ve r s that R-<& tyvocne feNadtnhdvi c e s r e qu i r e gu a r d -r i ffXgSS la b tiw e e n the 
NMOS devices to protecta gainst latch -up. 

Figure 6.14 show s a sized version ofthe output driver show n in Figi 
im p 1 e m e ffM©Sd6,iHe w 1 e 11 Pa c k ^uredffs c0t8 ve ga t e -1 e n gt h process [DKS93]. Th i s 
output driverexhibits a 2ns output latencyand Ins rise/falltim es. 1 
capable ofm atchingexternali ntQja n dl al^Obln st ds teatlw t h © nd 3(1 ve r c o n s u m e s 


115 


pu impedance 



Show n above is a digitallyc on trolled va liable resistance driver from 
a c t u a 11 y RM©£ devices in both resistance networks since itfocuses on o 
a d i ffe r e n t vo 11 a ge r e gi o n ). Th e fiuLgmpxdanaela le $ drimpedance 

determ in e w hich transistors are enabled w heneverthe driverdrives ; 
respectively. The transconductances ofthe enabled p a ra lie 1 tr a n s is to r s 
the transconductance 111 fent jyr a elna tihde thiignoau t p u t p a d . 


Fi gu re 6. IdklOS Dr i ve r w i t h Di gi tally Co n trolled Ou t p u t Im p e d a n c e 
appro xi nr atelylOm W+ 2m W/ lOOMFIz o f p o w er. 

6.6 Receiver 

Th e receiverm ustconvertth enlp» uvt -svbght a^icsw hml^is w inglogic signalfo 
inside the com ponent.In the interest ofhigjh c-sepi eeerdws fey i ctk Hr iansghwi g^i w £ 
gain forsm allsi gn aide viations around thdlinn gdp-jp tbiienst.lfG&ISw e e n the si 
and [KK88] bndt nee suitable d cffeirve m st i aR6.$t5 show s onecsai icvh rr Th e 
r i gh t m ostinverter pair (Hand 12) i n Fi gu r e Ge 1<5 £oi vanr ±> a ads iefid it © nt iiitp 1 w hen 
the input vo 11 a ge seen bjcthasdia^alljllad leoxwp -pd jd hi giensd 12 are identical 
inverters. The ienapcuht sat@ taken through resistors to w hoait nvdo u Id n o r rr 
connections ofthe inverters.The resistor betw een the pad and 12 is t 
resistor w hich m CMQSt iei ip futt cpna ds. The resistor betWiei egtvotlht agjaalfsigna 
leveland II is an identical resistor for reference. The norm alinputa: 
structure are shorted togetherso thatllserves as a bias ge n e ra to r p la i 
ofoperation.Ifthe inputpad voltage connectedltn g 2iw 1 fc ar ge al bsw la, t h a 1 f l 
the tw o devices w o u Id be in identical voltage states and the outputof 
also be m id-range.As the pad input voltage seen byI2varies aw ayfrom 


116 


pu impedance 


ME 

•— i 

1W 

_ir 

2W 

_ \f 


r v . 

s ig 


IL 

4W 

ir 





8W 

L 



> f < 

IL 


p u- W D 


p d- W D 


To Ou tp u t 
-► and 

T ra n s mis s io 


8W 


[4W 


[2W 


I- \\ m 

I 4- 


pd impedance 


Gr o u n d 


Pad 
n Line 


Show n above is a digital con trolled -im pedance driverafter [DKS93]. Th e i 
o npuJmpe dance a n fidJinpe dance enable the parallel im pedance control transis 
Dr iver transistors are placed in series betw een the im pedance control 
putpad.The desired si gn a 111 nnegpvbel d atge tihatpad byenablingthe approprit 
drive tra n sis to r. Th e digitalim pedance controls rem a in static duringn 


Fi gu re 6 . CMOS Driverw ith Separate Im pedance and Logic Controls 


Hand 12 r a p i d 1 y b e c o m e unbalanced leadingto a high -ga in output fr or 
is slightlyabove the halfvoltage 1 e ve 1 , t h e sw itchingthreshold of 12 bee 
supplied byll. The bias on the gates o f 12 d e vi c e s appears like a low in 
drives a high output. Si m ilarly, ifthe pad input to 12 is slightlybelow th 
sw itchingthreshold ofI 2 becom es low er than the II bias causingthe ] 
In response, 12 drives a low output.Finally, Bserves to restore the reco 
to a fullrail-to-railvoltage swingfordrivinginternallogic.In orderfor 
should be sized so thatits m idpointvoltage tracks the m idpointvolt 
va r i a t i o n . 

Fi gu r e 6.16 s h o w s a ve r sd o at i oefrt kk aw n S.h 5 M ^uiiceh w as im plena ented 
i nCMOS26, He w 1 e 11 Pa c k redffs c(M ve ga t e -1 e n gt h p r o c e s sn |J)KSt 93 ]a tThnec y 
through this receiver is appro xim atel yilrfcs r Th e st oet a 4 i Am rl a tiedn tehyp d r i ve r 
described in the previous section is 3 ns. 


117 


CD 

-t—' 

w 

D) 

CD 


c 

o 

o 

CD 

o 

c 

CO 

T3 

CD 

Q. 

E 

E 

o 



All transistors are 0. 8u m I o n g . 

Sh o w n here i(SM0)Ssdiizi vfe rcircuitfrom [DKS93], Allw idthsare show 
Thisdriverw a s d eMdsgb.ifd *i Ire 11 Pa c k p i dEfe’sc 0.8 ve ga t e -1 e n gt h proces 


Fi gu r e 6.14: Co n t r o 11 e d Im p e d a n c e Dr i ve r Tm p 1 e m e n 

6.7 Bidirectional Operation 

The drivers and receivershow n in the previous secdiiiopilseoca 
b id irectio n al sign allin g. as n ee <j) sidi forr ttsh cfe ersocur i bnegcfc onmCh a p t e 
pad w ould contain both a driverand a re c e ive r. At a n yp o in t 
line w ould be con figu red to drive the line and the other to 
both its pull-dow n and pull-up enables turned off. In this m 
inpreteieiver look like higbrim epcet (Scams tocth e transrn ission lin 
behaves as the high-im pedance connection w e expecton the 
transrn ission line.The drivingend ofthe transrn ission line d 
enable connectin gpap spi taol ltihn eglransm ission line through th 
n e t w o r k . Wh en it is necessaryto turn the connection aroui 
netw ork,the i/o pads can sw ap roles as driverand receiver. 


i in m i c r c 
s . 

t a t i o n 


n be con 
r 4. A s i n g 
in t i m e c 
r e c e i ve . 
o d e , b o i 
e . Th e re 
: d e s t i n a 
r i ve s e i t 
e adjust 
id to rev 


118 


lo g ic 



Show n a b o ve i s a r 4KK£8]vdir vieft 4 ilras n ffi a r e identica FId/fel wc e s ( 

WP2, WN1 = WN2). 71 b i a s/<2 sn to its h i gh -ga in r e gi o n . Wh e n the vo 11 a ge on the 
inputpad is slightlyhigherorloM ienrgttto d tia tab a' dill i fisri griehfie s 

the vo 11 digs tandardizes t h 12 diourtip ltd ©nf side the com ponent. It should be 
sized to have the sam e vo lllaagB M. idpointas 


Fi gu r e 6. CMOS Lo w -vo 11 a ge Di ffe r e n t i a 1 Re c e i ve r Ci r c u i t r y 

6.8 Automatic Impedance Control 

The drivers described in Section 6.5 all allow ed the output im p ed < 
section w e turn our attention to the task o fa u to m aticallym atchingt 
attached transrn ission line im pedance.In anysuitable schern e,w e 
w hether the term ination im pedance is high or low and a m echan 
inform ation back to upd atttd rt lg.dStiarr tpi t® g) \a ri tehe tfieee d e nxtc lrsdaensd tri b e d 
in this chapter,w e can obtain the inform ation necessaryand close 
a discrete -tim e sam pie register and allow titii g g a n d s ss atm phle vanl ip © sd a l 
through the test-access port (TAP) (Se c t i o n 2.2 and Ch a p t e r 5). 

6.8.1 Circuitry 

Figure 6.17 show s the scan architecture fora bit doi net© 11 b ® a tlasn gh aarlcg.a d 
b o u n d a r y-s can eraecghs peard .has an im pedance control re gister and a sa 
im pedance re gister holds the digitalim pedance setting for the pull- 
in im pedance controls schem es such as the ones show n in Figures ( 
registercan but nwd lei 1 tsecna n controlthrough the TAP t o c o n figu re the pull- 
netw ork im pedances.The sam pleregisteris s htocwn no icnc M gi* me n6.llK.cWh e i 
logic value to be driven outofthe pad,an enable pulse is fed into the 
ripples thro ugh the in verter chain enablingeach nspa unt pal leu ree gi\s he r t o s t 
inverterdelays apart. The digitalinputvalue to the sa mcpe He/erie. gis te r c o i 


119 



In p u t 
Pad 

All transistors are 0. 8u m I o n g . 


Sh o w n here is the d i ffe r e n |tDKS98]e <Abli we r dfi hmare show n in m i c r o n s . 
Thisdriverw a s d eMQSgifD.ild *i Ire 11 Pa c k p i dtfe’sc Q.8 ve ga t e -1 e n gt h process. 

Gr o u n d e d -Pin ve rte r s are u s e <£ d n itte ar d ai ffla e a' it h iaa ri ic ravuas plem entary 
inverters as in Fi gu r e 6.15. Exp erim ental evaluation su gge sts that the geo 
transistors used for the di ffecr u hdt i h fe idea a' gdreim so rd e r to p r o vi d e go o d 
stab ilityto p ro c e ssin gvariatio n . 


Fi gu re 6. tfMOS Lo w -vo 11 a ge , Di ffe r e n t i a 1 Re c e i ve r Im plem e n t a t i o n 


Thisreceiverm aybe the sam e one used forreceivingsignpailst.w hen the 
The keyrequirem entisdhai \tehrege m p d i tre s one logic value w hen the pa 
above the signpjtllyirgsid-pointvoltage and the opposite logic value w he 
below .Follow inga transition ofthe outputlogic va 1 u e , t h e sam pie ref 
ofcloselyspaced tint e sam pies ofthe digitalvalue seen bythe receive] 
read underscan controlthrough the TAP to provide a digital,discrete-t 
the outputpad.Figure 6.19 show s the com bined i/o pad circuitry from 
provides access to read the sam pie registerand w rite the im pedanc 
o f a n a 1 yzi ngthe data recorded bythe sam pie register and selectingth 
o ff-c hip controller. 

6.8.2 Impedance Selection Problem 

Thegoalisto settheoutputim pedanceto achievent atched series te 
line w ere ideal,the signalshad no appreciable rise-tint e,and the rout 
linew a s d e fin ite lylo n ge r th a n oursant pie r e gi s t e r , t h e i m pedancesel 
voltage at the pad ofa nt atched transnt ission line w o u Id look like Fi 
high transition. If the series resistance w as a little low er th an optinr a 


120 


P a-® i n g Scan 
D -elnt a 



D a-tQu t 


rPc e 


3 

n*fc < 


O 

O 


Fi gu r e 6.17: Bi d i r e c t i o n a 1 Pa d Sc a n Architecture 


< s a mp te register output value> 


input 

fr o m_-_ 




• • • 

r e c e i v|e 

e n a bl 3 
p u Is 4 

} r 

- D O 

Late 

- G 

w 

ri 

h 

- D C- 

Late 

- G 

w 

ri 

h 

- D C- 

Late 

- G 

w 

ri 

h 

- D C- 

Late 

- G 

w 

h • • • 

^>o— • • • 


Asim pie sam pie re gis te r is com posed ofa sequence oflatches each ei 
delays apart. Wh en a transition occurs on the o u tp u t p ad , a shorten a 
into the sam pie register.Each sam pie latch records the value seen byt 
w as lastenabled. Aft erthe enable pulse propagatesthrough the sam pit 
register holds a discrete-tim e sam pie ofthe value seen bythe receiver. 


Fi gu re 6.18: Sa m pie Re gi s t e r 


w ould settle a little above the m id -pnopi im tt (yea lit a gp gptih cb tsrai pi tfh las ir e gi s t e r 
read allones.Sim ilarly, ifthe series resistance is a little h i gh e r, t h e lin 
trip pointand the sam pie re gis te r w ould read allzeros.To setthepull 
t h r o u gh va r i o u s i m p e d ae m c dit tfiaatgiva gs c Ait n figu re the outputim pedance 
force a transition ofthe output usingthe scan capabilities. Follow it 
the value ofthe sam pie r e gi s t e r, a ga i n usingthe scan TAR Wh e n w e fin d 
w here the sam pie register changes from readingall zeros to reading 
the appropriate pull-up im pedance setting. The sam e basic operatio 
transitions to con figu re the pull-dow n im pedance. 


121 







<D 

O 


o 

CO 

_Q 


<D 

Q) 

^signal 


CD 

05 V , 

cc v signal 

- 2 
o 


> 



_ 

_ 




i 

1 

1 

1 

1 

1 

1 

1 

1 

1 

- 4 - 

1 

1 

1 

1 

1 


0 T 2T Tjme 


Show n above is the voltage atthe source -end ofan 
m i s s i o n line fo 11 o w in ga low to h i gh transition fr o 


ideal,m atche 
m thedriver. 


d 


se rie 


Fi gu re 6.20: Id e al So urtie (Tran s 


Un fortunately, there are m anynon -i deale ffectsw h i c ht team sn o t b e i gn < 
look m ore like the ones show n in Figure 6.21. Si nee there is a finite rise 
re a 1 s ign a l,th e d rive r w illalw aysrequiresom e tint eto d rive the tra n s n' 
point. Thesam pleregisterw illnotcontain allonesorallzeros. Du e t o 
CMOS process,the tim e betw een subsequentsam pies in the sam pie re ^ 
a sam pie bitrelative to the outputtransitpoDm mnatjtraijawiifi a hyefmot mThcits t 
variation can easilycause the inter-sam pie tim e to varybyas m uch as 
ittakes finite tim e fo r the si gn alto getfrom the inputpad to the sam p 
sam pie w ere taken w hen the output started chan gi n g, several sam pi 
inputto the sam pie registerreflects the voltage on the outputpad.Pro 
this skew betw een the pad volta gee t Fa o|d si itt it ss nvaarry (fr <b hme com ponentti 
com ponent. 

As a result, the sam pie values returned from an im pedance scan 
Table 6.1. As the im pedance decreases,the line does trip from low t( 
w here the low to high trip occurs becom es earlier as the source in' 
com m ensurate w ith ourexpectations ofa finite rise-tim e. Eve n t u ally, 
This is an indication ofthe num berofsam pie tim es w hich elapse b 
sam pie register and the arrival ofthe fasteatppudd Thr a he sea tri P m ut tn r b> a lgh t 
ofbit tim es this requires w illvaryfrom com ponent to com ponent d 
settingthe control im pedance re qu Frge sTd hid tv. lfc atm & a dt h h st idfjattha e( b e s t 
im pedance setting for proper series term ination. 


6.8.3 Impedance Selection Algorithm 

Aheuristic strategyw hich w orks w ellin practice forselectinga m z 
atthe derivative ofthe seqgm Tjk bel <p i6d)fii si (1 center in on the center ofthe 
derivative region. That is,w e look atw heeadhes arm ip si Et iaonnd ota ck re r Ish ien 
deltas betw een im pedance pairs.The search focuses on findingthe 1 
the large st deltas betw een an ad ja cent pair ofim pedance values.For 
trip -point,the sam pie registerw o u Id n e ve r trip and a b o ve , i t w o u Id z 


123 




Impedance 

Setting 

Sampled Data 

0 

0000000000000000 

1 

0000000000000000 

2 

0000000000000000 

3 

0000000000000000 

4 

0000000000000000 

5 

0000000000000000 

6 

0000000000011111 

7 

0000000011111111 

8 

0000001111111111 

9 

0000001111111111 

a 

0000011111111111 

b 

0000011111111111 

c 

0000011111111111 

d 

0000011111111111 

e 

0000011111111111 

f 

0000011111111111 

10 

0000011111111111 

11 

0000011111111111 

12 

0000011111111111 

13 

0000011111111111 

14 

0000011111111111 

15 

0000011111111111 

16 

0000011111111111 

17 

0000111111111111 

18 

0000111111111111 

19 

0000111111111111 

la 

0000111111111111 

lb 

0000111111111111 

lc 

0000111111111111 

Id 

0000111111111111 

le 

0000111111111111 

If 

0000111111111111 


Show n above is sam pie data fora 16-b i t s a m p tliar lgc gpsd «i id Th e i m ] 
to the binaryencodingofthe enables for 5 e xp o n e n t i a 11 y s i ze d 
im plies a 11 th e transistors are disabled,w hile If i n idb.i c a t e s th; 
the low estim pedance setting). 


Table 6. ^Representative Sam pie Register Da t a 


e d a n c e s 
i m pedar 
t a 11 t h e t 


124 









<D 

O 


o 

CO 

_Q 




Show n above are a series ofm ore realistic depictions oft he voltage w 
the source end ofa series term inated transm ission line. At the to p , w 
impedance situation. The middle diagram shows a case where the 
im pedance is too 1 a r ge , a n d the bottom diagram s show s a case w here 
is too s m all. 

Fi gu r e 6.21: Mo re Re a 1 i s t i c Sdiiiomcse Tr a n s 

the change occurs w ould clear lybe the point w here the largest del 
im pedance setting. 

Na i ve 1 y, w e could scan t h r o u gha Icoectikt iirmg p a dljaant caed p airs. We could id 
the pairofim pedance values betw een w hich the largestdifference o 
im pedance settings to con figu re the im pedance netw ork. Flo w e ve r, s 
strategycan be m isled.Itisoften the case thatthe difference betw een 
perhaps one ortw o bitpositions.Thatm akesitdi ffic ultto decide w h 
- e.g. considerthe case in w hich there is a run of five im pedance setti: 
position, follow ed by five im pedance settings all identical,then a diff 
and no subsequentdifferences.Apair oriented algorithm w ould s e le 
change is reallyin the m iddle ofthe five im pedances w hich differbyo 
In stead,w e use an algorithm; eve h si cvh lljosank a Id 4 is im pedance inter va 


125 












lu e 


Set Impedance: 

1 oldJiimped — m iddle im pedance va 

2 oldJimped — 0 

3 limped — m iddle im pedance va 1 u < 

4 himped — 0 

5w hile the old and currentim pi 

6 oldJiimped — himped 

7 himped — fin dh i gli m p e d biin/a <£) ( 

8 oldJimped — limped 

9 limped — fin dlo wm ped himped)( 
lOsetim ped /a/n/e <kj himped 


Fi gu re 6.22: Im pedance Se lection lAJogip l) i t h m (Ou t e r L 


attem ptto zero in on the largestdelta.To avoid m issinglarge gaps w hi< 
the recur si ve search divides the region into pieces and recurses on 
space w ith the la rgestgap.Further, since the valtuthncgfltdae sohp apve site im 
a second-ordereffecton the otpitii gi va le i iit epr a dea tihcreo .uegh searching for t h 
and low -im pedance settings until the 6s22 duettia) ill ss tehoen bvea isgec. tFI go r & t h m 
forconvergingon a pairofim pedance va 1 u e s . Fi gu r e 6.23 describes the a 
in on an im pedance value usingthe heuristic strategyjust described. 

6.8.4 Register Sizes 

Wh en adaptingthe im pedance control strategy described here to i 
process,it is im portantto consider the am ountofgranularityavaila 
and the tim e w indow covered bythe sam pie register. The num ber o 
and hence the num berofbits used bythe im pedance controlregiste 
im pedances to w hich the pad needs to m a tc h , th e potential proces 
m ism atch w hich is considered negligible.The num berofbits in the s 
the size ofthe w indow required toi giioarr asile e cfhrdi t tihaetta - hlnpsr ocess corn 
one could d eatpa-inme ixaetlyw hen to sam pie the output,onlya single bit 
w o u Id be necessary. Flo w ever, since processingand operatingtem p 
tim ingofthe in p u t an d outputcircuitry, the num berofbits in the sam 
such thatthe w indow spans allpotentialtim ingvariations. 

6.8.5 Sample Results 

The test com ponen fDKSSSi); wi b e d a n figu red w ith a 16-b it sam pie r e gi s 
sixbits ofboth pull-up and pull-dow n im pedance control. Figu re 6.24 
at both e n d tl hrfaan 50m ission line for an im pedance selection determ 


126 



Find Impedance: 


1 

Ip — h i gh e s t i m pedance value w ith no trai 

2 

/ip -low estim pedance settin 

gw ith sam t 


in the sam pie registeras the 

h i gh e s t i m 

3 

w h i( {kp — Ip ) > 2) 


4 

hmid — hp — (^ hp ^ lp> j 


5 

I mid Zp+( fcp j' p ) 


6 

iftransition differe hmmak ahn djtivs 

gree rater t h a 


t h a t b e t k/nid a n tip 


7 

hp — hmid 


8 

else 


9 

Ip — Imid 


10 r e t u 



s i t i o n 
va 1 ii e 


setting 


This version ofthe algorithm divides the rem ainingdistance into third 
overlappingtw o -t hi r dsre gi o ns.Ala rgerfr action could be used form ore 
gr eater selection accuracy, at the expense ofslow erconvergence. 


Fi gu re 6.23: Im pedance Se lection okbgp )• i t h m (In n e r L 


the algorithm described above. At the process corner represented b 
sixbits ofim pedance control w ere m ore t h a n Os tur ffio isemi t stsoi ccie a n ly 
1 i n e . Fi gu r e 6.25 show sthe m atchingaclpioe rvesdi foamtdh hsustaomr eaCiocnhm peda 
selection algorithm w hen few ercontrolbits are used. Of course,a dif 
curve w o u Id provide a differential pedance resolution and range.Figu 
diagram s w hen the sam e pad is aut oOrt raatri <s anl liy mi a tic hi a d .t o a 100 

6.8.6 Sharing 

On e option w hich m aym ake sense in m anysituations is to share 
perhaps im pedance control,betw een severalpads.Ifa group ofpads 
p h ys i c a 1 m.ge d a an( e PCB), the externaltransm ission lines th e yd rive w illh 
sam e characteristic im pedance. If the pads are physicallyclose on th 
process variation from pad to pad. In such cases,it m akes sense to s 
im pedance controlw ithin the group ofpads. In cases w here w e canr 
aboutthe externalim pet did fa e ep, <btsw icbtlddos share a single sam pie regis 
group o fp h ys ic a llylo c a 1 p a d s . Su c h a shared sam pleregisteronlynee 
to select am ongthe possible inputsources. 


127 



1.50 

1.25 

1.00 

c /3 0.75 

o 050 
> 

0.25 

0.00 

- 0.25 

- 0.50 
















1.50 

1 
















1 .Z.D 

1.00 

c /3 0.75 

O °' 50 

> 

0.25 





















































m 







* 0.00 

- 0.25 

- 0.50 


















2.00 ns per division 


2.00 ns per division 


Show n above is the voltage profile seen atboth the driverand receivere 
50£2transm ission line.The im pedance m atchingshow n here w as deter: 
usingthe algorithm detailed in Fi gu r e s 6.22 and 6.23. 


Fi gu re 6.24: Im pedance Ma t c h i n g: 6 Co n t r o 1 Bits 

6.8.7 Temperature Variation 

Tem perature is an im portantenvironm entalfactorw hich affects t 
circuit.Tem perature a ffe c t s the tran s (CMOS idnut e gaaitced oc fid e iviict a sn idi ha e n c ( 
the term ination im pedance. The autorn atic im pedance m atchingde 
to adjustthe term ination im pedance to be m atched atthe tem perat 

The process described abovew o u Id norm allybeperform ed aspart 
fo r e a c hp oocnm nt.In som e environm ents.it is possible forthe com pon 
w idelyduringope ration. As the tem perature varies from the point w 
place, the term ination im pedance w illdeviate from the transm iss 
e ffe c t is s i gn i fie a n t e n o u gh to a ffe a 1 i t yntih 9smr d> sustiiai ig p e hitaobc o 1 w ill n o t i 
higherthan norm alrate ofcorrupted m essages through the com pon 
system attern pts to localize errors (Se e Ch a p t e r s 5), i t c a n rerun the n 
adjustthe im pedance for the current operatingtem perature. Am or 
taken byintegratinga tem perature sensor onto the integrated circu 
on-chip temperature sensor,the scan controller could periodically 
com ponent. Wh en a com ponent’s tem perature indication differs sigr 
indication w hen the com ponentw as last ceacla Hi irbartae tie, t h e in apnecdoam ter e 
setting. Us ingthe port-deselection and pmidtdb cyepdo irrt Ch at jp fae <r i5l i hi e simtnr 
controller can isolate individual port pairs and recalibrate their dr 
h a vi n g a significantperform a n c e im pact. 


128 




Vo 11 s Vo 11 s 


1.50 
1.25 
1.00 
0.75 
0.50 
0.25 
0.00 
- 0.25 
- 0.50 

2.00 ns per division 2.00 ns per division 


1.25 
1.00 
c /3 0.75 

o a5 ° 
> 

0.25 
0.00 { 
- 0.25 


-0 so 


1.50 

1.25 

1.00 

0.75 

0.50 - 1 \ 

0.25 

0.00 • \»^ v . 

- 0.25 

- 0.50 

2.00 ns per division 


1.50 
1.25 
1.00 
c /3 0.75 

o () ' 50 

> 

0.25 
0.00 
- 0.25 
- 0.50 

2.00 ns per d i vi s i o n 


Show n above are the voltage pro files seen close to the driverand rec 
50£2 transm ission line follow ingboth high and low outputtransitio 
selected autom atically. Since o n ly3b its o fc o n tro 1 w ere used,the 
used to sim ulate param etervariation -in the top pairoftraces the 
the bottom itw as enabled. 

Fi gu r e 6.25: Im p e d a n c e Ma tc h in g: 3Co ntro 1 Bits 

6.9 Matched Delay 

In Se c t i o n 3.2 w e pointed outthatpipeliningbittransm issions 
preventthetransittim es across w iresinthesystem from havingai 
bandw idth and latency. With the circuitrydeve loped in the previo 
how to reliablypipeline m ultiple bits o n pt b ra svnitrse. s betw een rou 


e i ve 

n s . T 
h i gh i 
b it w 


o ve r 
t e ga 

Ll S S 
ting 


129 







1.75 

1.50 

1.25 

1.00 

c/3 0.75 ■ 

o a5 ° 

> 

0.25 

0.00 

-0.25 

-0.50 
















1.75 
































1.00 

c/3 0.75 


. . 






























" 












Q U.DU 
> 







'Tfy.J 

- 



.. 




0.00 

-0.25 

-0.50 
































2.00 ns per division 2.00 ns per d i vi s i o n 


Show n above is the voltage profile seen atboth the driverand receivere 
10012 transm ission line. The im pedance w as m atched autom atically. T 
voltage levelis the resultofthe finite im pedance oft he m easurm e n t a p ] 


Fi gu re 6.26: FlXBm pedance Ma tc h in g: 6Co n tro 1 Bits 


6.9.1 Problem 

The w ires in fere o n n e c tin gc o m ponents in the netw ork varyin len 
tem perature variations,the exactdelaythrough an inputoroutputp 
Wi th arbitrarylength w ires and uncertain i/o delays,there is no guar 
arrives at the destination com ponentrelative to the system clock. If 
setup tim ejustbefore the clock rises orduringthe hold tim e just a 
receiver can clock in indeterm inate data. To avoid this potential pr 
delaythrough each outputpad to gua rtai o he ae rtrh i aetst h 6 tshitgndael stlriannast i o n a 
reasonable tim e w ith respectto the clock. 

6.9.2 Adjustable Delay Pads 

To c o n tro 1 th e arrivaltim e ofsignals atthe destination,a variable del 
the internallogic and the fin a 1 o u tp u t d rive r. Th is bufferis designed to h 
such that it can alw ays m ove the arrivaltim e ofthe signaloutofthe 
processingand tem perature. For the granularityofcontrol necessa 
variable delaybuffer could s im plybeasr one h si ttiqo he ste quperro cvied a rf tgp p s offof 
chain ofinverters (Se e Fi gu r e 6.27). For fin er control,ofcourse,a voltage c 
be used instead of, or in addition to,a variabl 6.2£)e Fa gum eu 6.29p 1 e xo r (Se e 


130 








Fi gu re 6.27: Mup 1 e xo r Ba s e d Va r i a b 1 e De 1 a y Bu ffe r 


i n p u t 
v c W 


g n -d 



output 


Show n here is a voltage controlled delayline (VCDL) a ft e r [Ba z85] and [Jo h 88 
VCTRL effectivelycontrolsthe a tmv© lionatcb feap abcy t<h as cohuitrpw t Oefr 
stage and hence the delaythrough each inverterstage.The num berofs 
the VCDL w illdepend on the r a n ge o f d e 1 a ys r e qu i r e d fro m the b u ffe r. 


Fi gu re 6.28: Vo 11 a ge Controlled Va r i a b 1 e De 1 a y Bu ffe r 


show s a revised pad architecture w hich incorporates the variable c 
con figu ringthe delaythrough the com ponest’s [TAPrT<h io - <p in id rc^ r tvmr aaim d 
the sam e as in the m atched im pedance pads described above. 

6.9.3 Delay Adjustment 

We can use the sam e basic str ate gy used for m atchingim pedance 
is.byw atchingthe voltage levelatthe source end ofthe transm ission 
round-trip transit tim e across the transm ission line. Si nee w e can 
the sam e tim e to propagate from the source to the destination as it 
the destination to the source.w e know that the signalarrived at the 
the round-trip transit tim e.Allw e need to do i si tii e> tne a - tfr ti me t\h da e n th 
h a 1 f-w aypointto the fullsignalvoltage railas w ellas w hen ittransitic 

The inform ation w hich w e record in the sam pie registerw hen sc 
im pedance va lues,in e ffe c t,a Ire a d yp ro vid e s usw ith thisinform ation 
r e gi s t e r. Th eeicnepiwet ir is setto trip w heneverthe voltage on the source en 
line exceeds the hal f-w ayp o iiikikg trw id eintihaersrigpa la operation, w e selt 
driver im pedance to m atch the transm ission linecsontlfathtilse sou re 
hal f-w a yp o in t d u ringthe round trip -tim e.If, forexam p 1 e , w e w ere to s 


131 


P a-@ i n g Scan 
D -elrta 


D a-tQu t 



p fc±i mp e d a e 


O 

O 


Show n above is the revised bidirectional pad architecture incorp 
buffers in the outputpath as w ellas a sea nn able registerto control 


Fi gu re 6 . 29 : Ad ju stable De lay Bi directional Pa d Sc an Architec 


three tim es the characteristic im pedance ofthe transm ission lin 
m idpointanddEiip/etiisa'tw hen the line w as driven,butw hen the r 
the fa rend ofthe transm ission line.Thatis: 


Vi = 

( ^)F signal 

v Rdst = 

Fi 

F B = 

JR^src 

G) Vr *« 


Fo r a t r a n s ittidD,rt haetpad voltage becom es 


V = 


F signal 


forthe perdb<d =&. Aft erthe reflection returns and reflects againstthe 
term ination, 

' = VRsrc + V R dst + Fi = F r signa i 

fo r the p io d < Qualitatively, the situation resem bles the case 
term ination is too large as show n in Fiegu e es sSall'.ylifo fia tht ei tl irsi Heoitmr p e d 
t o b @ 6)3 as ass um e d ,fo r th is behaviorto hold. As longas the driverim 


A) A Zg r i ve < 4 Zq 


o r ; 
t h e 

t u r 

e , t 

e fie 


u n 


w 

a n c 
p e 


132 






select 


< s a mp le register output v a lu e > 



Fi gu re 6.30: Sam pie Registerw ith Sejiaidtable Clock I 


the sam pleregisterw ill trip a ft er the a 1 1 lAsvd (b m fgtdrse t heet a x m rpe Iffe c d ig® snt e r 
i s s it ffic ientlylong. w e can determ ine notonlythe im pedance setting,! 
and reflected w aves occur. 

No tice that the delayth reocuegjhvet h e s i tnhpeu st arm e w hen seiatrohingfor the 
to the m idpointas w hen searching fortiliimm. i'Hi pi cs i fiht d od & lid jr tahi i d luagih s 
the receiver is cancedeck ionugta tUn os nd le 11 a tim es to determ ine w hen thi 
atthe destination. A1 so note thatthis is a discretized tim e sam pie li 
gr a n u 1 a r i t y. 

We stillhave the problem s thatthe delaybetw een sam pie bits is pr 
ofthe sam plesrelative to actual si gn alt ransitions is uncertain. Thes 
byallow inga version ofthe clock to be sw itched into the sam pie regi 
the input rve r (Se e Fi^l). rim this w ay, the inter-sam pie bit tim es can be ( 
term s of fr actions ofthe com ponent’s clockt poenr ifo ri t Hfeb ©tillpthtjitapjic i 
and the enable pulse on the sam pie registers are synchronized to th< 
alignm entofa signaltransition atthe fa rend ofthe transm ission line 
t h r o u gh t h ee cnepi ve tr r Ad dinga delaym argin for variation in the receiver 
m argin necessary for clean s ign a 1 re c e p tio n ,th e va ria bilte odne d a y c a n be 
alw ays arrive atthe fa rend ofthe transm ission line ata w elide fin ed ti 
other piece ofinform ation w hich w e getfrom this con figu ration is t h 
requiresfora bitto travelacross apiece ofinterconnect.This inform a 
the routingco ma pooantfir the pipeline delaysiatfc isnogcbi httse ad c w oi $ h t r a n 
the associated interconnect (Se e Se c t i o n 4.11.3). 

6.9.4 Simulating Long Sample Registers 

The discussion in the previous section assum ed the sam pie regis 
actually record sam plesuntilsom e tim e C]fiQES26t Hee w d flet t ffa 0 fa aediisrn e d . 
0.8 p process,in verterdelays run about M)6 ]h ss taom2Q0 Jges b Tib iisse n a b 1 e d roup 
200 p s t o 400 ps apart. Each nanosecond ofw ire,or about 15 cm ofw ire 
sam pie bits in the sam pie re gis te r. Ac tu a llyb u ild in ga longsam pie rep 

‘Actually, in an ideal settingthe im p ^3d-Ki V&JJSo Gsa4ifi4B’(g a s h i gh as 


133 


the range ofw ire lengths ofinterestw ould be im practical. Ho w e ve r , w 
register bydelayingthe sam pie pulse into a short sam pie registerbya 
ifw e can delaythe pulse into the sam pie registerbym ultiplesofthe s 
slide the sam pie registerforw ard in the tim e sequence.Byperform in 
w ith varyingoffsets forthe sam pie re gister pulse and recordingthe san 
transition, w e can virtu a llyre c o n s tru c t th e w aveform w hich a longsa 

Figure 6.31 show s a s im pie sam pie register architecture for s im u la l 
in this m anner. The enable pulse ripples through the in verter chain a 
pulse reaches the end ofthe in verter chain it is optionallyrecycled I 
pulse finishes cyclingthrough the inverters the con figu red n u m b e r o 1 
w ill contain the values recorded duringthe last c yc 1 e . Care , o fc o u rs 
the recycle path and in recon structingthe w aveform .If the recycle pa 
delaybetw een the lastsam pie bitin one cycle and the fir s t sam pie bi 
notbe identicalto the inter-sam pie bit delay for bits enabled duringt 
is sm all,it m ayonlym ake the sam pie granularityslightlycoarser.Figu 
th a t re c yc le s the sam pie bitbefore com pletinga cycle.As a resultoftb 
alltransition can be pin-pointed to inter-sam pie bittirn es.The w a ve f 
the overlappingsam pies to m ore accuratelym im ic a single longsa m 

The m axirn urn operating frequencyofthe counter and com paratorw 
ofa sam pie register w e can use in this schem e. The sam pie register 
longto allow the com parator htri ad acnodu h £ ep rls) jgiacr hod sfe r th e nextenabl 
Ifw e ass um e a delayofatleast 100 ps through each inventre trearn d w e a s 
w e need a sam pie registerw hich is a t 1 e a s t 25 i n ve r t e r -p a i r s longforp 

6.10 Summary 

In thischapterw e addressed the issue ofhigh osnpeenetd .sWgniad Id hn- g b e t v 
t i fie d the problem oftransm i 11 i n g Ip ict si b a tt w aese ai p o imt t ft g epo cni nttransm 
line si gn allin gproblem . We saw thatalow -vo 11 a ge s w in g, series-term i 
nallingschem e provided the low -latencysignallingw e desired w hile 
low .In order to address the issue ofterrn ination m atching, w e intro 
end ofthe series term inated transrn ission line. This allow ed us to 
im pedance to the characteristic line im pedance,com pensatingfor 
vironm ent.Finally, w e show ed thatthe basic m atching m e c h an ism s 
the delayalignm entnecessaryto reliablypipeline data across w ires 

6.11 Areas to Explore 

In this chapter,w e have detailed a m atchingschem e that uses digit 
m atched im pedance. It w ould also be p oescsd bdeer tot an dee he ci tl tw phle tb a a 
the im pedance selection w as high or low .Such a schem e m ayrequir 
m ore am enable to autom a t i c , o n -c h i p calibration. 


134 








Th e e xt e n 
tested . No d 


s i o n s n e 
o li b t, w e 


cessaryto allow delay m atchinghave no 
stand to learn m ore aboutthis problem 


t, a s ye t 
fr o m s li 


136 



7. Packaging Technology 


Wh en w e actuallybuild anysystem , w e m ustph yq^i o a lei ly p sa. cWe a ge its p 
m ustprovide a physicalm e d iu m forthe w ires inter connectingcom 
a m echanical support substrate to organize and house the com pon 
packaginga netw ork w ill directlyaffect the size ofthe packaged net 
connection distances and transit latencybetw een com ponents.In 
packagingtechnologies and develop hierarchical schem es forpack 
in Chapter3. The packagingschem e deve loped here m akes use ofalli 
m inirn ize interconnection distances. 

7.1 Packaging Requirements 

Wh en packaginga netw ork w e have m any, often conflicting, goals. We 

• Minim ize the in terc o n n e c t dl^tan c e s (and hence 

• Provide controlled -im pedance signalpaths (Ch a p t e r 6) 

• Supplyadequate pow erto allcom ponents 

• Facilitate s ynocuhsr clock distribution to allcom ponents 

• Coolcom ponents byre m ovingthe heatgenerated by ICs duringope 

• Facilitate physicalrepair 

• Minim izepackagingcost 

To keep the interconnectdistances short,w e seek dense packagings 
as close as possible. Exc essive density, how ever,m akes supplyingpt 
ph ys ically repairing fa u Its di ffic u 11 . Hi gh -p e r fo r m a n c e p a c k a gi n g a n d i 
the m ostexpensive partofa system to m anufacture and assern b 1 e . 
strategy, w e m ustkeep in m ind the econo m i c s ofthe available techn 

7.2 Packing and Interconnect Technology Review 
7.2.1 Integrated Circuit Packaging 

Conventionalpackagingtechnolo gjls <t arrrt si iwt et£r sptae dkcai ga; di Bt s as the b 
level b uildingblock. Silicon ICs are diced fro me fehde ifia HGpcaactli (» ge w . a fe r 
Finepitch w iresare bonded from pads around theperipheryofthe die 
shelves alongthe perirn eter ofthe die cavityin the ICpackage.The pac 


137 



and bondingw ires from the environm ent. Dicingand packagingallo\ 
com ponents b lji fiy n n tii cs p a e d . Th e packaged ICcan be m ore easilyhan 
and assem blythan the bare die.Packaged ICs can be replaced as d e fe 

Today, m ostICpackages are plastic encapsulants,ceram i c , o r fin e -1 
Plastic packages are i n e xp e n s i ve , b u t onlyallow connections aroun 
lim ited capacity for heat re m oval. Ceram ic packages are m uch m o 
herm etic sealing for the die cavityand allow gr eater heat transfer fir 
package.Highpow erICpackages provide paths ofhigh therm alcondu 
package w here a heatsink m aybe m ounted to disperse the heatrei 
circuitboard ICpackages uhthld deo tgpi asssiarmi « etan PCB production and can 
experience,tools,and m anu fa cturingdeve loped for fin e-line PCBs . Ce r a 
packages can supporti/o connections coen. Tih bssgi ®efl tripsp^oilihge surf 
p i n -gr i d array (PGA) arran gem ent. 

Wh ether the pins are arran ged around the peripheryofthe ICfora pi 
in a gridded fa s h i o n , t h e package size is generallydeterm ined bythe c 
size and achievable densityofexternali/o connections. As a result,a ] 
than the housed die.The size ofan ICpackage m ayonlybe correlated ti 
because both the package size and the die size are often directlydete 
pinson the com ponent. 

7.2.2 Printed-Circuit Boards 

Pa c k a ge d ICs are assem bled on printed -circuitboards (PCBs ). Th e s e b 
support for the ICs and provide the first level ofinterconnection amo 
Conventionalprinted-circuitboardtechnologya PCBsw Miktti p he a n u fa c t u 
layers ofetched copperprovide interconnect in tw o-dim ensionalpl 
are separated b yla ye rs o fin s u la tin gm aterials.Drilledoaimcbptlated hoi 
am ongthe tw o-dim ensional etched co pi pi 4a )®@J^.eBasc & a gh din ICs s i n gl e m 
m aybe located on one orboth sides ofa com posite PCB. 

So m e p a c k a ge d ICs have pins w hich can be inserted in m atingholi 
via a socket. Du ringassem bly, solder is used to conne cPCBh e com pone 
Com ponentpackages w ith pi Ihesdr a fjh ti hr as hvoaly fihd'D u gh the PCB and are 1 
term tkrdugh-hole com ponents. An other com m on fo r m o f p a c k a ge d ICs h a 
w hich can be soldered to exposed m etallands on a surface ofthe P( 
w hich siton the surface ofthe PCB and solder to PCB lEGB,daa‘ ®v ithoutprc 
cal \ surface-mount com ponem Cse. dim mftcom ponents onlyconsum e space on 
ofthe PCB connected to the IC, w hereas thro ugh -hole com ponents con 
t h e PCB include the oppaxsang PCB surf 

Acorn m on discipline w hen d e s ign in gp rin te d -c ire u it b o a rd s fo r d i 
dedicate apow e r plane fo reach pow e re.1ge VSrlorie ia}iili,r,d ti,., i,n). the system ( 
Th is discipline allow spow er distribution w ith m in im alresistive 1 o 
and the com ponents. Dedicated pow er planes also provide low in d u 
supplies and the packaged ICs ’pow er leads. So lid conductor plane 
reduce the cross-talk betw een s PgtBnlbt ga a e a nota ethceosiasmsfcent control’ 


138 


im pedance interconnect,si gn alt racoasn ad tie cgeonr epr laalijer o n boevfc w ae c n a p 
ofsolid conducting {ftTJ84])e s (Se e [R 

As a practical m after, there is a lim itto the system size w e can plac 
board. De spite the fact that PCBs are ge nltf irpa M ycd ai in tp <d 4 a $le o sf,nPQB 
technologyis essentiallytw o-dim ensional. The size ofa single print* 
bym echanical stability, yi eld, and m a n u fa c tu rin g c o n s tr a in ts . To da 
m a xim um vi a b 1 e side 1 e n gt h fo r PCB t e c h n o 1 o gy. In fa c t, fo r m a n u fa c tu i 
m anufacturerslim i t o n e PCB side dim ension to less than 14 inches. For 
yield considerations w illgenerallyprovide m ore severe lim its on the 

To d a y, PCB features dow n to 8m ils (1000 m ils = 1 in c h ) a re considerei 
m anufacturers can produce features dow n to 3 o r 4 m ils ,b u t th e over; 
m anufacturingcosts are roughlyproportional to the BGBm her oflaye 
Be low the feature sizes used in volum e production,the costincrease 
Wh en dealingw ith m uht hl<a tye giRCB.teecc stis also dependenton the variety 
holes required. 

7.2.3 Multiple PCB Systems 

Wh en a system design exceeds the size w hich can be e ffic ientlypla 
circuit board, the system m ust IPCBb u ri It to fiiam hic te ob Miap ieonnectors an< 
cables. Bo ards interconnected via a backplane PCB is,by fa r,the do m in 
interconnect,today. In this case,one PCB is used to interconnect m an 
on the “backplane’’board allow other boards to m ate ortho go nally 
produces a structure w hich takes som e advantage ofthree-dim ensic 

Wh en a system exceeds the size practical to build in backplane fa 
p h ysic a l,o r m e c h a n ic a 1 re qu ire m entslim itbackplane use,portions 
vi a cabling. Aga in ,c o nn ec tors o n each printed-circuitdbroraardtprovide a ] 
Ratherthan directlyattachingtw o orm ore boards,a cable o fin su late 
connectthe boards. 

Cables for controlled -im pedance interconnects com e in three pri 

1. Ribbon cables 

2. Coaxialcables 

3. Flexible printed -c ire uitcables 

Ribbon cables are com posed ofa e a quhe siecpe acr fi 4 © ch diiyaho ims su 1 a t i n gm 
rial. FI at-ribbon cables can be used in a m aancnc as ip \ka lb ilce he ge m ta - cal 1 bydp-r < 
im pedance interconnect. Coaxialcables place a conductor inside 
cables have m ore stable im pedance characteristics,b u tare often bi 
the alter natives. Flexible printed -circuits use the w ell established F 
lam inates. Typ ically, flexible printed -circuit cables achieve controlle* 
o ve r gr o u n d plane tidfipa ® E£l|y fa m 


139 


7.2.4 Connectors 

To d a t e , m ostboard -to -board, board -to -cable, and board -to-packag 
iisingpin-and-socketconnectors. On e board,cable,or IC has a connec 
row s ofpins.The m atingunithas a connectorw hich houses a setof 
ge o m e t r y. Th e t w o p nenceecs t a rie bey m atingthe pin and socketconnection 

Mo re recently, a n u m bero henoem tporres sbs ieven hcecorn e available. On e c o r 
connectors can m ate directlyw ith lands on tw o PCBs or packages. W 
nectorprovides electrical interconnect betw een the lands. Co m pre 
characteristics w hich m ake them preferable to pin-and-socketcon 

1. Hi gh e r density 

2. Su p e r i o r signalintegrity 

3. Lo w er insertion fo r c e 

4. Function w ithoutsolder 

Pin -a n d -so c k e t c o n n e c to rs are lim ited bythe achievable density ford 
com pressionalconnectors are lim ited bythe area required to carrysi 
ofseparate conductors. Rem ovingthe need for solder m akes assern 
insertion force required forinsertingpin -and -socket connectors is p 
pinson the com ponent.As the num berofi/opinsincr eZOOs- e ,so d o e s th 
p i n PGAs are a Ire a d y e xp e rie n c in g e xc e s s ive dn si eerct to n rfbamai fafo trie n a igsc 
to m ove to m ore com plicated sockettingschem es w hich m ate pins 
outthe required insertion fo rc e . Tra d itio nalpin-and-socketconnecti 
controlled -im pedance paths, w hile m anyofthe ern ergingcom press 
s i gn a 1 p a t h s . 

Severaltechnologies currentlyavailable forcom pressionalinterco 

• An isotropic conductive elastom er 

• “Bu 11 o n balls” 

• Sp rin gs 

Severalm anufacturers now produce strips or sheets ofelastom erw 
conductors are arranged to conduct onlyalongone axis. In this w ay, 
betw een co nadem cdbopnfs <p d i t e sides ofthe connector and lined up alongt 
The elastom erw ill com press underpressure all bvw e d eg d lb rei oc aol n d u c t 
c o n t a?.g. t [$n c 90] [Po 1 90] [Te c 88] [ND90]). “Bu 11 o n b a 11s ”a rqis p mnpgo id dv oi f 25 
com pressed into srn alldiam etg.t e2© m yilli d darmc a t dirobl y 40 ifn ilhigh)in a plas 
carrier. Theyprovide m ultiple points o f c o n taa t b esW s eit hhee bPCBll a n < 
w hen com p .g.e [49s ^88] [Srn o 85]). Sp r i n ges tiyAee c to rs houEcshffljfmdepail 
w hich behave as springs in a flexible carrier. Wh en com pressed betw ■ 
ofthe m etalforces positi vPCBsi q Pa te <b Ph i i hdtdrosncnfet hc^o afGPM92] 

[Co r 90]). 


140 


bo 11 

I 



Printed-Circuit Board 
Co mp r e s s i bl e Co n n e c t o r 

DSPGA Packaged Componen 



n u t 


'rig id 


plate 


Shown here is a cross-sectional view ofa pac IPOI^ mg' stack. Com ponen 
sandw iched in a Item a tin gla ye rs to form a three -dim ensionalstack str 
This stack structure serves as the next level ofthe packaginghierarchy 
form s the basic buildingblock outofw hich largersystem s can be buil 


Fi gu r e 7.1: St a c k Structure fo r Th r e e -d i m e n s i o n a 1 Pa c k a gi n g 

7.3 Stack Packaging Strategy 

Leveragingm ostlyconventionalpackagingtechnologies.w e can p a c I 
utilize a 11 th re e spatialdim ension. We continue to use fairlyconventi 
technologybut stack c PGffispiai ntdi te t d ia nn dnsion ortho go nal to the PCB p 
fo r m stack structure sandw ichinglayers ofpackaged ICs b e t w e e n PCB 1 a yi 
Com pressionalboard-to-package connectors provide signalcontinu 
printed -circuit boards and integrated -circuit com ponents. This sta 
packed three-dim ensional cube ofcom ponents and interconnect 
buildingblock foreven largernetw orksand system s. 

7.3.1 Dual-Sided Pad-Grid Arrays 

Wh ile there is noveltyin ourdesign and use ofthe ICpackage,the bas 
w e ern ployis conventional. The integrated -circuit is housed in a pa 
ofcontacts. Rather than b e in g p in s , th e contacts are land grids sim : 
for attaching sou it fia tc ce o-mr ponents. These land grids are connected thr< 
connectors to sim ilar land grids on the PCBs . Du e to the low -insertic 
a va i 1 a b 1 e fr o m these 1 a n d -gr i d a r r ae ys e(LGAy h the eoynh itvanr attractive optio 
packaginghigh pin e-.<g.o |Bn C 89$). (In t e 1, fo r e xa mdpolp t h (& st hae LGA p a c k a ge 
fo r its 80386SL m icroprocessor [Ma 191]. 

We m ake one fin al addition to the LGA structure. Rather than placin 


141 



the bottom side ofthe p a c k a ge , w e place the land -gr id a r rayon both 
optionally, pro vide continuitythrough the package betw een the top a 
resultingst rdnaiJ-\siiladq?ced-grid array (DSPGA) to em phasize the factahadpads are p 
on both sides ofthe package.Each verticalpad paircan be con figu red 

1. The correspondingpads on the top and the bottom ofthe package 
thro ii gh w ithoutconnectingto an ICpin to support verticalinterc 

2. The correspondingpads can be connected togetherand to an ICp 
m ake contactto traces on eitherorboth ofthe boards above and 

3. The correspondingpads can be connected to di ffe rent IC pins to s 

No tconnectingthe correspondingtop and bottom pads as in (3) r e qu i 
u fa c t u r e and w illrn ake the package m ore expensive than ifonlycoi 
used. 

Fi gu re 7.2 shows DSPGA372, a 372 pad DSPGA w e h a ve d e veul p p e d tsDSPGA372 s 
160 IC si gnal connections, 76 through signals which do not connect to 
supplies supported by 72, 40, and 24pads,respectively. Alllands are 30 m i 
around 10 m il plated holes. Contacts aridigyiobp B£E@A137B> tr disi gh n o b 
three internalpow erplanes forprovidinga low -resistance and low -i 
and the externalpow er supplies. The nom inalground plane is suppo 
planes supplythe logi d’/p ; ,p avn edr tshuepspi a 1 1 iunpgp fciyjy, „ p Ad d i t i o n a 11 y, 
space is provided in the p a <d lu aigfe bfeyp - as susr faacpe a-m itors across the pow e 
The 76 t h r o u gh pins allow the package to supplytiicre fiOgh faute re o n n e c t 
signals which do not connect to the IC. The rem aining 160 pads suppo 
Each ofthese 160 signals is available on both the top and the bottom o 
Table 7.1 s u m m arizes the physical dim ensions ofour DSPGA372 p a c k a g 
pictures ofthe package. DSPGA372 w as fabricated bylbiden using BT (Bi s i 
as a seven -la ye r printed -circuitboard. 

Th e DSPGA372 package has dedicated coolingand alignrn entholes.The 
be used to align the package to the com pressionalconnectorand a 
The inner holes open into the heatsink cavity underneath the die all 
flow across the heat-sink forheatrem oval. 

7.3.2 Compressional Board-to-Paekage Connectors 

Pa c k a ge d ICs are m a t e d PCBist,hb ® d Ija a b n> tve and below .through com pr 
connectors. These connectors provide through contact betw een tl 
PCB. Us in gself-align in geo no p neescstiooms ado solder is needed to m ake reli; 
Properlyselected com pressionalconnectors w illprovide consisten 
asrequired fo r h i gh -s p e e d s i gn a 11 i n g. 

Fi gu re 7.4 show s a picture of BB372, a com pro snsni e q taol r bdm tst bgn ebdo taor d c 
m ate w i t h DSPGA372. lBB3G2s e s 372 buttons aligned w i tffeSRQL^7Ea n d gr i d s on 
The buttons used bfi B>Bu>'£2eadr &40 dflilcylinders. Fi gu r e 7.5 show s a close 


142 


372 contact chip carrier 
1.400 


ooooooooooooooooooooo 
oooooooooooooooooooo 
o oooooooooooooooooooo 

OOOOOOOOOOOOOOOO I 

OOOOOOOOOOOOOOO 



OOOOOOOOOOOOOOO 
-5^.0 OOOOOOOOOOOOOOO 

ooooooooooooooooooooo 

-oooooooooooooooooooo 

ooooooooooooooooooooo 


package 


Co o I a n t 
Ch a n n e Is 


A I i g n me n t 
Ho I e s 


45" sq ma^ 

nrzq WZZZZZZZZZZZZZZEZ& i =E= 


He a t S ink 


DSPGA372 i s a 372 pad dual-sided pad-grid a nrra §tcAld i[b a drsa aghetct h r o u gh 

betw een the top and bottom ofthe package.76pads do notconnectto 

existsim plyto provide through interconnect. (Ar tw ork byFred Drenckht 


Fi gu r e 7.2: DSPGA372 


picture ofa button ino Irhnee <BB37£. cThe buttons provide low -resistance 
irn pedance interconnect. Th eocnemet e Icorfi b e jBB372too accorn m odate the 
or lid associated w ith the m ating DSPGA372. BB372 is 30 m ils thick allow 
or lid attached to DSPGA372 to extend atm o s t 30 m ils verticallyabove or 
Com plern entaryholes are provided for coolant flow to m atch those c 
has tw o stubs at opposite corn eDSP(3AK72call i gn art c iwt ill b ltdrse. Th e stubs 
protrude on both sides ofthe carrier, allow ingthe carrier to nr ake 
the attached PCB and DSPGA p a c k a ge . Th e BB372 carrier is nr a d e fr o nr Ve c 
Po 1 ym e r [Co r 89] and w as fabricated byCinch.Table 7.2 sum nr a r i ze s the p 
Ou r nr a i n d i s a p p o iBB85£ h a & \b a fehn the handlingcare required. The f 


143 








Ta b 1 e 7.1: DSPGA372 Ph ys i c a 1 Di m e n s i o n s 
To p (d i e c a vi t y) Bo tto m (heatsink) 



Pictures of DSPGA372 show n actual si ze 

Fi gu re 7.3: DSPGA372 Ph o t o s 

com posingthe buttons can easilybe pulled outofthe cylindricalhol 
the buttons,the w ires often attach to the ridges on the person ’s finger 
the fingers m ove aw ayfrorn the connector. As a result,the connector w 
i m properly. Wi th proper equipm ent,the buttons can be restuffed,sc 
inserted into a system and com pressed,the buttons re m ain situated 
Initialexperim eairtsdwdt hva elastom er fij2>2h$ uFg|g<p p tothy^Rue lastom eric 
technologyis a viable alter native fo r this application. El astom eric co 
hum an handling. The elastom erprovides a sheetofanisotropic cont: 
isrequired to m atch the pad georn etryofthe package or PCB. Si ze and si 
custom ization required fo reach application. As a result NRE costs are 


144 















































Pi c t u r e of BB372 show n actual si ze 


Fi gu re 7.4: BB372 


Feature 

Size 

Co n n e c t o r Ou ,41'ixi le4" 1 

Co n n e c t o r Fie i gh t 

30 m i ] 

Center Opening 

/8800mi i 1 

Button Di a m e t e r 

20 m i 

Button Sp a c i n g 

50 m i 1 

Co o 1 i n g Flo 1 e 


(d i a m e t e r ) 

100 m il 

A1 i gn m e n t St u b 


(top h e i gh t) 

28 m i 1 

(bottom height) 

48m il 

(d i a m e t e r ) 

78m il 


Ta b 1 e 7.2: BB372 Ph ys i c a 1 Di m e n s i o n s 


us slightlym ore freedom to choose the connector height. As a side b 
to gasketthe coolant fo reed through the coolingholes. On the n e ga t i \ 
low er current capacityand higherresistance than the button -board 

7.3.3 Printed Circuit Boards 

Printed-circuitboardsaresandw iched betw een com ponentlayers 
Th e s e PCBs are fairlycon lvtei hat ye> m, a b nm t ur o lie cPGBn . p'Hidea dan by special 
accom m odations required are the land grids and alignm entand co 
w i t h the DSPGA packages.The PCB lands m im ic the geo m etryofthe DSPG/ 
nobilitycontact. tdicee BGB is hdr ibe go Id plated. Alignm entholes allow cor 
a s t h e BB372, to align w ith the PCB lan d p attern . Co o lihtgh© heosoalnenle qu ir 
flow through the coolantholesprovided in the connector and ICpack 


145 
















V- 



Closeup ofbutton on BB372 


146 





In side the stack,each printed -circuit boa net araa hi tpeas we irt It st iwt eo, c o m 
one above the PCB and one below it. Mo stofthe pins on the com pon 
PCB should notbe connected to the co nrceesip tqEnockiienngtpd mstba ctlpepauijte 
side o f t h e PCB. Th e PCB m ust not provide contiaa<i:th/bDd : it\seseiirflhcespads 
w hich occupythe sam e planar location. Th is re quirem entcan be sai 
m anufacturingtechnologybyeither usingvias w hich onlyconnect in 
routinglayerson the PCfii, n> g tth jeo vfifa s a s s o ecai a he let w di tsho theydo notinterse 
In som ecase,continuitybetw een correspondingpinson the com pon 
is desirable. Pow er si gnals,busses,and globalsignallines are com m < 

7.3.4 Assembly 

Astack is assern bled bybuildinqgnpdayoirs oafrPQBp ,acc k a ge d ICs . Fi gu r e 
depicts the com position ofa typical stack. Figure 7.6 show s a m ore c 
com ponent stack. Figu re 7.7 show s a close-up cross-section ofan ass 
placed bol hsot ih tr bhueg stack provide verticalcom pressive force and prc 
board alignment. Arigid metal plate at the top and the bottom ofth 
com pressive force across the stack. The alignrn entpins and holes p r > 
carriers,and packaged com ponents. Th eBEGPBgja IrhoeBnatcjhi ib su ptnonvi d e d i i 
board to align to tB&PGAiajai ode BCtB independently. Ap a re e a rt 4 tr, c ® lmgn 
at everystage. Alignrn ent tolerances from layer toei ac^B372 irre n o t a d d i t i 
30 m ils thick and the m ati ©§IJG(A3T2i ios n8Gbnf eiis:trhick,the space betw een P 
in an assern bled stack usingthese com ponents is 140 m ils. 

7.3.5 Cooling 

On ce stacked,the coolantholesin the DSPGA packages,carriers,and 
vertical coolingchannels through the stack struct uB&PGATh e heat sin 
com poneaaclasrfidg the fo u r c o o lin gch a n n e lp 8 snseoncti. allheed cwh atih ni fesl s: o m 
allow forced airorliquid coolantto be circulated through the stack a 
proper heat sink design,the coolant flow ingacross a com ponent bf 
e xp e r i e n c e turbulent flu id flo w to e ffe c t e ffic ient heat tran sfer from t h 
All coolant channels can be 1 e ft open allow ingparallel flow across 
colurn n. Alternately, everyother coolant hole can be p lu gge d fo r c i n g 
sinks in a coolantcolum n. 

7.3.6 Repair 

Co m pon eanctermpeln tis sim plified biy three s 4 ileal re s 1 eTa sreplace a faultyco 
pon eHCB, orocnnector,w e sim plyneed to disassenkb <bewt h eges <6 adc k , s u b 
replace nr entforthe faulty unit,and re-assen®bheeChtEosnta(Dlb .vTh fee stdsr led e 
need to desolder com ponen tfK]Bsn. dDfrcecwu ® s k , fmoewl ienrem nunset <b tes d i s c 
from the stack and coolantdrained before the stack can be disassen 


147 



Show n above is an enlarged cross-section ofa netw ork c 
(Di agram courtesyofFred Drenckhahn) 


Fi gu re 7.6: Cross-section o f Ro u t i n g St a c k 


7.3.7 Clocking 

An ysynchronous system requires thatclocks be distribute 
the com ponents see the clock edges at appro xim atelythe st 
arrivaltim e s is k ir/ocfo'sfewaasntdh ,ege nerally, acts to lim it the clocl 
setup and hold tim es.The clock distribution problem in th 
problem ofclock distribution on anylarge PCB or m u 11 i -PCB : 
distribution trees and low -skew clock buffers can help m in 


o m p o n e i 


d to a 11 c 
m e tim t 
rate by 
e stack i 
; ys t e m . ( 
i m i ze t h i 


148 











Show n here is a close-up picture ofa m ated sepo fiftBSl^a n d DSPGA372 c o : 
w hich has been cut at an oblique angle to expose the topologyofthe m 
The stack show n above is com posed ofa BB372, DSPGA372, BB372, and DSPGA37 
sandw iched inside a plastic encapsulant. The encapsulant serves to 
togetherafterthe cross -sectional cut w as m ade. 


Fi gu re 7.7: Cl o s e -u p Cross-section o f Ma t e d BB3"|£<a n d nD&PGA372 Co m 


For short stacks in w hich the propagation delay verticallythro ugh 
ponentsis sm all, it m aybe su ffic i e n t t o c a eeafo 11 1 y d> I si tmi h id tie ©hnee cl hqye k 
o f PCB (Se e Fi gu r e 7.8) d) In enne ct the clock signals verticallythrough each c< 
stacks,the propagation delaythrough the col urn n m aybe intolerable 
through the vertical interconnects m aynotbe su ffic ientlycont rolled 
tion.Alternately, w e can use a tw o-tierfanoutschem e.Each PCB 1 a ye r s 
identicalclock fan o u t,sim ilarto the single layerfanout.The input to 
from another fa no u t tree through carefullytuned lengthsofcontrolle 
flexible printed -circuitcables.The additionalstage offanoutadds son 
An other option is to provide a directconnection to each clocked I( 
the edge arrivaltim e is care fu lljli ttant e\t& l^itnh i92]k Qn d ofclock distribut 
sim ilarto the m atched delaydrivers described in Section 6.9. Ho w e ve 
clock skew makeitamuchmoredi ffic ultproblem . 

7.3.8 Stack Packaging of Non-DSPGA Components 

As described so fa r,the packagingschem e re qu ires all ICs be p a c k a 
The netw orks described in this doc um entare builtoutofhom ogen 


149 










ClockedIC 


Buffer 


Primary Clock 

Sh ow n ab ove is a represent at ive clock fanout sch eme. Th e t race lengt h s in all clock runs 
sh ould b e b alanced so as t o guarant ee as lit t le skew b et w een clock edges as possib le. 

Figure 7.8: Sample Clock Fanout on Florizont al PCB 

long as w e can package our rout ing component in DSPGA packages, t h e ent ire net w ork can b e 
easily const ruct ed as describ ed. It is, nonet h eless, w ort h w h ile t o consider h ow t o accommodat e 
ot h er component s in t h e st ack. Th e net w ork endpoint s, for example, const it ut e component s ot h er 
th an rout ers, and w e may not h aveth e freedom t o package all such component sin DSPGA packages. 

Th e st ack st ruct ure w ill readily accommodat e loxfile-pamponent s in t h e spaces b et w een 
DSPGA component s. As not ed, using DSPGA372 and BB372 component s t h ere is 140 mils of 
clearance b et w een PCB layers. ICs w h icht camnfort ab ly w it h in t h is h eigh t can b e accommo¬ 
dat ed in t h e st ack. Th e h eigh t requirement precludes almost all t h rough -h ole component s including 
PGAs. Th rough -h ole component s furt h er complicat e t h e mat t er since t h eir pins generally ext end 
t h rough t h e PCB and int o t h e space b elow t h e at t ach ed PCB. Most leadless ch ip carriers (LCC) and 
gull-w ing surface-mount component s are around 100 mils t h ick and can easily b e accommodat ed. 

J-lead surface-mount component s are generally t h icker and leave infident clearance. Smaller 



150 







surface mount component s such as TSOPs are easily accommodat ed and may b e t h in enough t o 
allow component s t o b e mount ed on b ot h sides of adjacent b oards. Of course, t h e non-DSPGA 
component s only h ave direct access t o signals on t h e PCB tow h ich t h ey are mount ed. Such 
component s must make use of t h e spare, t h rough connect ions provided b y DSPGA packages w h en 
t h ey require vert ical int erconnect. Th e non-DSPGA packages are not part of t h e assernb led st ack 
cooling ch annels. Cooling for t h ese component s is limit ed t o h orizont al forced-air b et w een PCB 
layers. 

7.4 Network Packaging Example 

For t h e sake of concret eness, let us consider h ow w e package a small mult ist age, mult ipat h 
net w ork. Figure 7.9 depict s a mapping of a mult ist age net w ork int o a st ack package. Each st age of 
rout ers is assigned it s ow n plane in t h e st ack. Th e rout ers are dist rib ut ed evenly in b ot h dimensions 
w it h in t h e plane. Th e PCB b et w een planes of rout ing component s implement s t h e int erconnect 
b et w een adjacent stages of routing components. Since th d£f( Arfc rout ers in each stage, 
dist rib ut ed in t w o dimensions, each si@^ i £n ) long, making t h e w ire lengt h s b et w een st ages 
©( \/N ) long. Th e t ransit lat ency grow t h for t h is st ruct ure w ill t h us mat ch our expect at ions from 
Sect ion 3.1.5. If t h e input s and out put s are not all segregat ed t o opposit e sides of t h e net w ork as 
sh ow n in tflhgere, it w ill b e necessary t o run t h e input and out put connect ions w h ich originate on 
t h e w rong side of t h e packaged net w ork vert ically t h rough t h e net w ork layers t o connect t h e input s 
or out put s int o t h e net w ork. Th ese loop-t h rough connect ions are one class of signals w h ich use t h e 
st raigh t -t h rough int erconnect provided b y t h e DSPGA packages. 

7.5 Packaging Large Systems 
7.5.1 Single Stack Limitations 

Unfort unat ely, t h ere is a limit t o t h e size of our st acks and h ence t h e size of net w ork w h ich w e 
can b uild in a single st ack package. Recall from Sect ion 7.2.2, our PCB size is limit ed somew h ere 
under 30 inch es. fine-line t ech nology w e use. Vert ical layers are relat ively t h in. Consequent ly, 

if w e package t h e layers as suggest ed in t h e previous sect ion, w e normally do not run int o any 

ph ysical const raint s in t h e vert ical packaging dimensions. For example, a t ypical PCB t h ickness 
for t h eh orizont al PCB w ould b e 100 mils. Sect ion 7.3.4 not ed t h at using DPGA372 and BB372 
component s, t h e space b et w een PCBs is 140 mils. In t h is scenario, each addit ional net w ork layer 
w ill increase t h e st ack h eigh t b y just under 0.25 inch es. Since t h e PCB side size is increasing as 
0( \/N ) and the numb er of st ages, and h ence h eigh t , is increaQjijpg£ N )), w e encount er t h e 
PCB size limit at ions b efore any vert ical const raint s. 

Nonet h eless, t h e vert ical const raint s t h at may arise are most ly dominat ed b y cooling and signal 
int egrit y const raint s. As t h e numb er of component s in a vert ical column increases, in a parallel 
cooling sch erne w e w ill require great er pressureftad -rat es t o cool the component s. Similarly, 
in a serial cooling sch eme, t h e t emperat ure gradient b et w een inlet and out let w ill increase. As 
not ed in Sect ion 7.3.7, for siffeient ly t all st acks w e cannot rely on vert ical column int erconnect 

for h igh -speed, low -skew , glob al signal dist rib ut ion. 


151 



Logical Diagram of 
Simplifed Network 

From Processing Nodes 

I 

Net w ork Input s 

WMi 

Maaaaaa 

Net w ork Out put s 


* 


To Processing Nodes 


Physical Network 
Construction 

From Processing Nodes 


print ed circuit 
b oard 



Rout ing 
St ack 


To Processing Nodes 


(Netw orks not draw n t o scale.) 


rout mg 
component 

[not e: rout ing component s 
are dist rib ut ed evenly 
in b ot h dimensions 
across each plane.] 


Th e diagram ab ove depict s h ow a logical st ack is mapped int o t h e st ack st ruct ure. Th e 
int erconnect b et w een each pair of rout ing st ages is implement ed as a PCB in t h e st ack. Each 
st age of rout ing component s b ecomes a layer of rout ing component s packaged in DSPGA 
packages. 


Figure 7.9: Mapping of Net w ork Logical St ruct ure ont o Ph ysical St ack Packaging 

7.5.2 Large-Scale Packaging Goals 

To b uild large net w orks. w e seek t w o t h ings: 

1. A net w ork st ack primit ive w h ich represent s a logical port ion of t h e net w ork and can b e 

replicat ed t o realize t h e connect ivit y associat ed w it h t h e t arget net w ork 

2. A t opology for packaging and int erconnect ing t h ese primit ives 

As developed in Ch apt er 3. for large mach ines w e focus on fat -t ree net w orks. Our prob lent is 

finding a decomposit ion of t h e fat t ree int o represent at ive sub -net w orks w h ich can b e implement ed 

in a single st ack st ruct ure. We desire h omogeneit y in st ack primit ives for t h e same reasons w e do 
in int egrat ed-circuit component s (See Sect ion 2.7.2). Wh en select ing a packaging infrast ruct ure 
for assernb ling t h e primit ive st acks, w e must address t h e same general packaging issues raised in 
Sect ion 7.1. 


152 








7.5.3 Fat-tree Building Blocks 


Recall from Sect ion 3.5.6 t h at w e can t h ink of a fat -t ree, mult ist age net w ork as composed of 
t h ree part s: 

1. a down net w ork w h ich recursively sort s connect ions as t h ey h ead from t h e root t ow ard t h e 
leaves 

2. an up net w ork t o rout e connect ions upw ard t ow ard t h e root 

3. lat eral crossovers w h ich allow a connect ion t o ch ange from t h e up net w ork t o t h e dow n 
net w ork w h en it h as reach ed t h e least common ancest or of t h e source and dest inat ion nodes 


Using radix-r, dilat ionr/crossb ar routers, w eb uild-any dow n t ree sort ing net w ork much like a 

flat, mult ist age net w ork. Rout ers in t h e upw ard pat h allow a connect ion t o connect int o one of t h e 

next (r — 1) dow nw ard rout ing st ages or cont inue rout ing upw ard. Th e upw ard rout ers compose 

b ot h t h e up net w ork and t h e crossover connect ions. F(ar evii)>logical t ree levels, w e h ave 

one upw ard rout ing st age. 

We can collect t h (e- — 1) dow nw ard rout ing st age, t h e associat ed upw ard rout ing st age, and 
t h e crossover connect ions int o a ph ysical t ree level. Each such ph ysical t ree level encompasses 
(r — 1) levels of t h e original t ree. Taking oi(ir — 1) dow n t ree st ages, t h e t ot al sort ing performed 
b y a ph ysical t ree levelrijj as given in Equat ion 7.1. 

r p = ? >(? — !) (7.1) 

Th e size of t h e logical node at each ph ysical t ree level w ill increase as w eh ead t ow ards t h e root 
since t h e b andw idt h at each t ree st age increases t ow ard t h e root. As a result, w e need t o furt h er 
decompose each ph ysical t ree level int o prirnit ive unit s w h ich can b e assernb led t o service t h e 
varying b andw idt h requirement s at each t ree st age. 

We use t h e t er rniit tree t o refer t o any prirnit ive st ack st ruct ure w h ich implcnfibretds a 
b andw idt h slice of each ph ysical t ree level. Th ere is a large class of unit trees b asedonth e parameters 

of t h e rout er and packaging t ech nology. Tab le 7.3 summarizes t h e paramet ers associat ed w it h a 

unit t ree st ack. Th e rout er paramet ers l, and w h ave b een discussed in det ail in Ch apt ers 3 and 4. 

At t h ‘th ot t dfrof the unit t ree, t h e numb er of ch annels h eaded t o and from ph ysical t ree levels 
closer t o t h e leaves is denot e$ q is a mult iple of t h e rout er dilat imti.w h ich det ermines t h e 
size of t h e b andw idt h slice h andled b y t h e unit t7rae.d0iErmined,c; can b e ch osen such 
t h at t h e size of t h e unit t ree is accommodat ed in a single packaging st ack. One generally; w ant s 
large for increased fault t olerance and resource sh aring. Th e availab le packaging t ech nology w ill 
limit t h e size of any st acks and, h ence, liqiitln t h ese respect 6j is much like t h e rout er dilat ion, 
d; w e generally select as large a value as w e can afford given our ph ysical and packaging limit s. 

At t h e leaves of a t ree, w e need t o connect t h e processors int o t h e t ree, and h ence w e need a unit 
t ree w it h ch annel capacit ies mat ch ed t o t h e input and out put ch annel capacit y of.each node ( 
ci = ni = no). Th e ch annel capacit y in and out of t h e t op of a unitr,t,recfully det ermined 
once ci and r are ch osen and is given b y Equat ion 7.2. 



153 



r rout er radix 

d rout er dilat ion 

w rout er w idt h 

ci ch annels per logical direct ion t ow ard leaves 

(depends on packaging t ech nology and syst ern requirement s' 
r p logical t ree levels per ph ysical t ree level 
(det ermined b y rout er radix) 
c r ch annels out of unit t ree t ow ard root 
(det ermined b y/ and r ) 


Tab le7.3: Unit Tree Paramct crs 



Routing 

UT$4 X 2 

Components 

UT 6 4x8 

Up Rout ing St age 

16 

64 

Final Dow n Rout ing St age 

16 

64 

Middle Dow n Rout ing St as. 

e 12 

48 

Init ial Dow n Rout ing St ag 

i 8 

32 


Tab le 7.4: Unit Tree Component Summary 

7.5.4 Unit Tree Examples 

For t h e sake of illust rat ion, let us consider t w ofkpimiit t ree cofigurat ions int roduced in 
[DeFI91] and [DeFI90]. Flere, w e denot e each unit tr e£T3$. /Cl . Both of t h ese unit t rees use 
t h e RN1 rout ing component, a radix-4, dilat ion-2 rout ing component (See Ch apt fcV'/^U 2 h as 
ci = 2 and, consequent ly,cy = 32. UT& X % h as-/ = 8andc r = 128. Both h avp = 64. Tab le7.4 
summarizes t h e numb er of component s composing each st age of t h e net w ork in each unit t ree. If 
each rout ing component is h oused in a DSPGA372 package measuring 1.4 inch es along each side, 
and w e leave as much space b et w een rout ing component s/Cib^st ack measures just under 
1 foot along each side, and t K W 64 x 8 measures just under 2 feet. Assuming BB372 connect ors, 
t h e four layers of rout ing component s in b ot h unit t rees are 1 inch t all. Wit h compressional plat es, 
each st ack is under 2 inch t all. 

At t h e leaves, a single unit t ree w/ ith// i = no is connect ed t o each clust crqf processors. 

To b uild t h e next size larger mach ine, w e replicat e t h e low er ph yskjdtambe! ated b uild a 

new t ree level out of enough unit t rees t o support t h e ch annels ent ering or leaving t h e root s of all 

of t h e low erph ysical sub trees. To quantejjychf annels come out of each sub t ree w letvhls, 

t h e t ot al numb er of unit t rees required t o form t h e root of t h e ph ysical sub t ree at ph ysical t ree level 

n + 1 is given b y; 

N n+i = ^ (7.3) 

Q 


154 




In t urn, t h e ph ysical sub t ree root ed at ph ysical t reeTetelv ill h ave a t ot al ch annel capacit y 
out of it s root given b y; 

C’ra+l — r ‘ N n ^.\ (7.4) 

Wit h t h e part icular unit t rees just int roduced, w e form t h e leaves of t h e t ree b y connect ing one 
I T 7 (,4 x 2 t o each clust er of 64 processors. If w e fi§® 64 X 2 unit t rees in t h e second level as w ell 
as t h first, w e nccff = 16 I I (,4 x 2 unit t rees t o form t h e root of each ph ysical sub t ree root ed 
in t h e second ph ysical t ree level. Th e sub t ree root ed in t h e second ph ysical t ree level support 
64 2 = 4096 nodes and includes 64 I I <54 x 2 unit t rees in t hfkst ph ysical t ree level. Alt ernat ely, 
w e can use^f = 4 UMfAxf, unit t rees t o compose t h e root of each sub t ree root ed in t h e second 
ph ysical t ree level t o support t h e same numb er of nodefimtdlevel unit t rees. To b uild an even 
larger rnach ine, w e can make 64 copies of a 4096-node t ree and int erconnect t h em using a t h ird 
ph ysical tree level composed oP^ = 256 I I (, 4 x 2 unit trees or^P = 64 I I (, 4 x g unit trees. 

Th e result ing sub t ree root ed in t h e t h ird ph ysical t ree level suf)parMTi4l 44 nodes and h as 
a t ot al of 64 16 = 1024 UT ^ X 2 or 64 • 4 = 256 UTm x s unit t rees composing t h e int emal t ree 
nodes in t h e second ph ysical t ree level and 4W6 54 x 2 unit t rees composing t h e nodes in tffrsfc 
ph ysical t ree level. 


7.5.5 Hollow Cube 


To ph ysically organize t h e unit t rees w h ich make up a large fat t refidiEnh dianner, w e 
must consider t h e int erconnect topology. Each group, ofib trees at ph ysical t ree levok ill 
b e connect ed t o t h e sub set of unit t rees at ph ysical t ned-ldwel h ich compose the root of t h e 
sub t ree. If each of t h e sub t rees at t ree Isediuposcd of U unit t rees and t h e parent t ree level 
is const ruct ed from t h e same size unit t rees, w e know t h e parent set of unit t rees w ill b e composed 
of 

TJ — C ' r TT 

vparent — ^ 

Cl 

unit t rees (See Equat ions 7.3 and 7.4). Th is gives us 


U chi l dren = T p ■ U = rl r -$ ■ U (7.5) 

unit t rees at t ree level connect ing t o 

Uparent = ~ ■ U = r {r ~ 2) ■ U (7.6) 

Cl 


unit t rees at t ree level + 1. From t h ese relat ions w e see t h at the group of unit t rees composing 
t h e root of a ph ysical sub t ree w ill generally b e connect! ddiC*>as many similarly sized unit 
t rees in t h e immediat ely low er ph ysical t ree level. 

Wh cir = 4, as in our examples from t h e previous sect ion, a nat ural approach t o accommodat ing 
t h is: 1 convergence rat io in our t h ree-dimensional w orld is t o blmlldw cubes. We select one 
cub e face as t h“to{3’ of t h e cub e and t ile t h e unit t rees composing t h e root of a ph ysical sub t ree 
in t h e plane across t h e t op face of t h e cub e. Toget h er t h e four adjacent faces in t h e cub e h ave four 
t irnes t h e surface area of t h e t op and h ence can bet iled w it h four t irnes as many unit t ree st acks. 
Th e sides can h ouse all of t h e unit t rees composing t h e root s tffT 1 ft =e 44 immediat e ch ildren 


155 




Sh ow n h ere is an example w h ere the unit t rees in b ot h ph ysical t ree levels sh ow n are t h e 
same size. Th isiscomparab letoth ecasedescrib ed in Sect ion 7.5.4 w h ere a 4096processor 
mach ine w as b uilt from t w o ph ysical t ree levels composed entfffEJy^funit t rees. 

Based on t h e st ack sizes assumed in Sect ion 7.5.4, t h is h ollow cub e w ould measure ab out 
4 feet on each side. 

Figure 7.10: Tw o Level FIollow -Cub e Geomet ry 


of t h e cub e t op. If t h e t op is part of ph ysical t req tdvai sides cont ain unit t rees w h ich are part 
of ph ysical t ree level — 1. We leave t h bb ot t dfrof the cub e open t o increase accessib ilit y t o t h e 
cub £s int erior. Figures 7.10 and 7.11 sh ow tw oh ollow -cub e arrangement s for mach ines composed 
of t w o ph ysical t ree levels. Figure 7.12 depict s t h e h ollow -cub e arrangement for a mach ine w it h 
t h ree ph ysical t ree levels. 

Th e h ollow -cub e t opology is opt imized t o expose t h e surfaces of st acks w h ich int erconnect 
t o each ot h er. All of t h e int erconnect b et w een unit t rees w it h in a h ollow -cub e fat t ree w ill occui 
b et w eenth e sides and face of some cub e. Th e h ollow -cub e t opology is a t h ree-dimensional fract al- 
like geomet ry w h ich at t empt s t o maximize the surface area exposed for int erconnect w it h in a given 
volume. 

7.5.6 Wiring Hollow Cubes 

Each of t h e unit t ree st acks in t h e sides of a cub e feeds connect ions t o and from unit t rees 
composing the parent sub t ree in t h e t op of t h e cub e. All of t h e connect ion in and out of t h e t op of a 
unit t ree st ack are logically equivalent. Th ese logically equivalent ch annels sh ould b e dist rib ut ed 
among the unit t rees composing the parent sub t ree for fault t olerance. Th is fanout from a unit t ree 
t o mult iple unit t rees in t h e parent sub t ree is desirab le for the same reasons fanout from the dilat ed 


156 






Sh ow n h ere is an example w h ere the unit t rees in t h eh igh er ph ysical t ree level are four 
t imes as large as t h e ones in the low er ph ysical t ree level. Th is is comparab le t o t h e case 
describ ed in Sect ion 7.5.4 w h ere the low est t ree level w as coiffpl^gxl ffrfiit t rees 
w h ile t h e h igh er level w as compb’sLdpf unit t rees. Like Figure 7.10, t h is h ollow 
cub e measures ab out 4 feet on each side. 

Figure 7.11: Tw o Level Flollow Cub ew it h Top and Side St acks of Different Sizes 

connect ions of a single rout er is desirab le (See Sect ion 3.5.3). Wit h proper fanout ent ire unit t ree 
st acks can b e removed from the non-leaf, ph ysical t ree levels, and the net w ork st illfteieaiti suf 
connect ivit y t o rout e all connect ions. 

Wire connect ions are made t h rough the cent er of each h ollow cub e using cont rolled-impedance 
cab les. Th e w orst -case w ire lengt h b et w een t w o ph ysical t ree levels is proport ional t o t h e lengt h 
of t h e side of t h e cub e w h ich t h e w ire t raverses. Any rout e t h rough the net w ork t raverses a cub 
of a given size at most t w ice, once on t h e pat h t o t h e root and once on t h e pat h from the root t o t h e 
dest inat ion node. 

7.5.7 Hollow Cube Support 

To support the unit t rees making up a h ollow cub e, w e b uild a gridded support sub st rat e much 
like t h e raisedoors used in t radit ional comput er rooms. Due t o t h e ph ysical size of t h eh ollow 
cub es, t h ey occupy room-sized or b uilding-sized st ruct ures. Th e st ruct ure t o h ouse t h e h ollow -cub e 
net w ork is b uilt w it h t h ese gridded w alls and ceilings t o accept t h e unit t rees w h ich are used as 
b uilding b locks. Conduit s for pow er and coolant are accommodat ed along grid lines in t h e gridded 
sub st rat e. Th e sixt h face of each h ollow cub e, vacant of unit t rees, supplies access t o t h e int erior 
and provides a locat ion for cooling pumps and pow er supplies. 


157 





Sh ow n ab ove is a h ollow cub e cont aining t h ree ph ysical t ree levels. If all t h e unit t rees 
w erel Tm /2 unit t rees, t h is st ruct ure w ould h ouse 262,144 endpoint s. Making the same 
assumpt ions as in Figure 7.10, the cent ral cub e in t h is st met ure measures ab out 16 feet 
along each side. Th e w h ole unit, as sh ow n, w ould measure 24 feet along each side and b e 
16 feet t all. 


Figure 7.12: Th ree Level Hollow Cub e 


158 












Rat h er t h an direct ly connect ing t h e w ires int o a unit t ree t o t h e unit t ree it self, w e can b uild a 
w iring h arness w h ich mat es w it h the unit t ree. Th is h arness collect s all t h e w ires connect ing t o a 
single unit t ree. Th e h arness makes compressional cont act w it h eit h er t h e t op or b ot t om of a unit 
t ree st ack t o connect t h e w ires t o t h e unit t ree. Th is w iring h hcmlsh sarhpik of replacing 
a unit t ree st ack. Wit h out t h e h arness it w ould b e necessary t o unplug all of t h e connect ions int o 
t h e out going unit t ree and t h en reconnect t h ern t o t h e replacement. Since each unit t ree generally 
support s h undreds of connect ions, t h is operat ion w ould b e involved and h igh ly error prone. 

7.5.8 Hollow Cube Limitations 

As int roduced, h ollow cub es are only w ell mat ch ed t o radix four fat -t rees and only ret ain many 
of t h eir nice propert ies up t o t h ree ph ysical t ree levels. Fdir.tth tdi ree t ree levels, t h e cub e 
side lengt h s increase b y a fact or of four b et w een t ree levels. Since each successive t ree level 
accomodat es 64 t irnes as many processors, side lengt h , and h ence w orst -case w ire lengt h s, grow s 
as \/W. St art ing at t h e fourt h ph ysical t ree level, t h e need t o accommodat e space occupied b y 
low er t ree levels increases t h e grow t h fact or t o six. As a result w orst -case w iring lengt h s grow 
fast er b eyond t h is point. Also st art ing at t h e fourt h ph ysical t red‘hevfllt Cffisuf many of 
t h e h ollow cub es b ecome b locked b y ot h er h ollow cub es limit ing maint enance access. 

7.6 Multi-Chip Modules Prospects 

Th e Mult i-Ch ip Module (MCM) is an emerging packaging t ech nology t h at furt h er improves 
component packaging densit y b y dispensing w it h t h e IC package. Bare die are b onded direct ly 
t o a h igh -performance sub st rat e w h ich serves t o int erconnect t h e die. Th e removal of t h e IC 
package allow s component s t o b e sit uat ed more closely low ering int erconnect lat ency. Recall from 
Sect ion 7.2.1 t h at package size is proport ional t o t h e spacing of ext ernal i/ o pins not t h e die size. 

Avoiding t h e package allow s t h e component t o only t ake up space relat ive t o t h e die size. 

Unfort unat ely, MCM t ech nology h as a numb er of draw b acks w h ich relegat e it s use t o small, 
h igh -end syst ems t oday. Few IC manufact urers are in t h e pract ice of supplying b are, t est ed die. 

Final, full-speed IC t est ing is generally done aft er t h e die is packaged. Th e facilit ies availab le for 
full-scale t est ing of unpackaged die are limit ed. As a result, it is generally not possib le t o know 
w h et h er all of t h e die w ill w ork b efore assernb ling an MCM. Since component speed grading is 
also generally only performed on packaged ICs, one h as lit t le know ledge of t h e yielded operat ional 
speed for each IC. Th ese draw b acks are compounded b yth efact th at repair and rew ork tech nology 
for MCMs is in it s infancy. Th e MCM t ech nologies availab le t oday generally are not amenab le 
t o die replacement . Consequent ly, st ocked MCM yield is low . Addit ionally, NRE cost s on MCM 
sub st rat es are comparab le t o silicon IC NRE cost s rat h er t h an PCB NRE cost s. Th e comb inat ion 
of t h e fact t h at MCMs h ave yet t o b ecome a h igh -volume t ech nology, t h e low er yield due t o lack 
of repairab ilit y, and h igh NRE cost s make MCM t ech nology uneconomical for most designs at 
present . 

Wh en MCMs b ecome an economically viab le t ech nology, t h ey may b e ab le t o replace packaged 
ICs and PCBs. St acks composed from layers of MCMs could b e a fact or of 3 t o 4 smaller in each 
of t h e planar dimensions t h an t h e st acks describ ed w it h DSPGA372 st yle component s. To b uild 
MCM st acks, w e w ould need t h e ab ilit y t o connect signals int o b ot h sides of an MCM sub st rat e. 


159 



If t h e MCM is limit ed t o single-sided or periph eral i/ o, t h e size of t h e MCM required t o sat isfy i/ o 
requirement s alone may negat e much of t h e densit y lit sndJnless significant advances are made 
in MCM repair, MCM st acks w ould not h ave t h e repair advant ages of t h e current st ack packaging 
sch eme. Th e h ollow -cub e t opology can b e used t o int erconnect MCM st acks w it h most of t h e same 
b cntit s and limit at ions. 

7.7 Summary 

In t h is ch apt er w e developed a t h ree-dimensional st ack packaging t ech nology. Using dual¬ 
sided pad-grid array IC packages, compressional b oard-t o-package connect ors, and convent ional 
PCBs, w e developed a st ack st ruct ure w it h alt ernat ing layers of component s and PCBs. Th e 
comb inat ion of DSPGA packages and compressional connect ors served t o b ot h connect packaged 
ICs t o h orizont al PCBs and t o provide vert ical b oard-t o-b oard int erconnect w it h in t h e st ack. We 
demonst rat ed h ow t o map a mult ist age net w orks int o t h is t h ree-dimensional st ack st ruct ure. We 
t h en looked at t h e limit at ions on t h e size of st acks w e can b uild. To accommodat e larger syst ems, 
w e developed a net w ork decomposit ion for fat -t ree, mult ist age net w orks w h ich allow s us t o b uild 
large fat-t ree net w orks from one or t w oprimit ive unit t ree st ack designs. We also sh ow edh ow these 
unit t ree st acks can b e arranged in a h ollow -cub e geomet ry t o const ruct large fat -t ree mach ines and 
comment ed on t h e limit at ions of t h eh ollow -cub e st ruct ure. 

7.8 Areas To Explore 

Many areas of packaging are quit e fert ile for explorat ion. 

• We h ave point ed out the limit at ions w it h current MCM t ech nology and suggest ed some 
requirement s necessary for t h e t ech nology t o provide real fit «nc 

• Th e h ollow -cub e t opology h as many nice propert ies up t o it s limit at ions. It w ould b e useful 
t ofind alt ernat ive t opologies w it h aw ider range of applicat ion. 

• If free-space opt ical int erconnect b ecomes a viab le t ech nology on t h is scale, t h e h ollow cub es 
can b ecome t ruly h ollow . Using free-space opt ical t ransmission across t h e long dist ances 

t h rough t h e cub e, w e could exploit t h e propagat ion rat e of ligh t t o keep t ransit lat encies 

low . Th e dist ance across t h e larger h ollow -cub e st agereirt lybarge t h at the savings 

due t o h igh er propagat ion rat e may make up for t h e lat ency associat ed w it h convert ing 
elect rical signals t o ligh t and b ack again. Recent w ork in opt ies promises t o int egrat e 
elect rical and opt ical processing so t h at w e w ill b e ab le t o b uild opt ical conversions int o our 
primit ive rout ing element s [Mil91]. Since ph ot ons do not int erfere w it h each ot h er, free- 

space opt ies makes t h e t ask of w iring t h e int erconnect ions t h rough t h e cent er of t h eh ollow 

cub e t rivial. [BJI^86] and [WBJ + 87] discuss early w ork on large-scale, free-space opt ical 
int erconnect for VLSI syst ems. Th ey use h olograph ic opt ical element s t o direct opt ical 
b earns for int erconnect ions. THlcxib ilit y of t h eh olograph ic media h olds out promise for 
adapt ive and dynamic connect ion alignment and reedigurat ion. At present, much w ork is 
st ill needed on eficient conversion b et w een elect rical and opt ical signals and emit t er-det ect or 
alignment . 


160 



Part III 

Case Studies 


161 



8. RN1 


RN1 is a circuit-sw itch ed, crossb ar rout ing element developed in t h e MIT Transit Project [MDK91]. 

RN1 may b e cofigured eit h er as an 8-b it w ide, radix-4, dilat ion-2 crossb ar mat en'(= 8, 
r = 4, d = 2, v: = 8) or as a pair of independent, radix-4, dilat ion-1 rout ers/.(e. two rout ers 
w it b = 4, r = 4, d = 1, w = 8) (See Figure 8.1). In b ot h dbgurat ions, RN1 supports th e 
b asic rout ing prot ocol det ailed in Sect ion 4.5. RN1 h as no int ernal pipelining. Each RN1 rout er 
est ab lish es connect ions and passes dat aw it h a single clock cycle of lat ency. 

Figure 8.2 sh ow s t h e micro-arch it ect ure for RN1. Each forw ard and b ackw ard port cont ains 
a simple Unit e-st at e mach ine for rnaint aining connect ion st at e and processing prot ocol signalling. 

Th e line cont rol unit s keep t rack of availab le b ackw ard port s and h andle random port select ion. 
Backw ard port arb it rat ion occurs in a dist rib ut ed fash ion along each logical out put column. Wh en 
several forw ard port s at t ernpt t o open a connect ion t o t h e same logical b ackw ard port during the 
same cycle, an 8-w ay arb it rat ion for t h e availab le b ackw ard port s t akes place. [Min91] cont ains a 
det ailed descript ion of t h e design and implement at ion of RN1. 

RN1 w as implement ed as a full-cust onfltMOS int egrat ed circuit using a comb inat ion of 
st andard-cell and full-cust om layout. St andardjve-volt, CMOS i/ o pads w ere used w it h t h is 
first -generat ion rout ing component. RN 1 w as fab ricat ed in Hew let t ’Fdckpand CMOS process 
(CMOS34) t h rough the MOSIS service. Th e RN1 die measures 1.2 cm on each side. Th e die size 


8x4 crossbar Two 4x4 crossbars each 

dilation 2 with a dilation of 1 



Figure 8.1: RN1 Logical Configurat ions 


162 




















































9 F OR WARD P 


H 

W A 

F 


CON T R OL 


9 I 9 I 

B A CK P 01 

1A , IB 


CON T R OL 


B A CK P OR 


3A , 3B 


F OR W A R D P 


F OR W A R D P .QEL.; 



CROSSPOINT 
ARRAY 




1 

1 

1 

1 

1 

1 


V 


v 

V 

V 

V 

V 

¥ 





F OR W A R D P 01 


F OR W A R D P 




F OR W A R D P 


F OR W A R D P 

OR § 

A 



Figure 8.2: RN1 Micro-arch it ect ure 


w as fully det ermined b y t h e perimet er required t o h ouse 160 signal pads plus pow er and ground 
connect ions in a single row of periph eral b onding pads. RN1 is packaged in a DSPGA372 package 
as sh ow n in Figure 8.3. 

RN1 can support clock rat es up t o 50 MFIz. Analysis of t h e crit ical-pat h t iming indicat es 
t h at t h e st andfintfe-volt i/ o pads and t h e st andard clock b uffer are key cont rib ut ors limit ing 
clock frequency. Th e input and output latencies for t h e RN1 i/o pads are each rough ly 10 ns 
(i.e. t l0 = 20 ns). Inside t h e i/o pads, t h e lat ency t h rough t h e IC logic is aroundd4 ns ( 

f switch — 14 ns). 


163 





































































































RN1 is h oused in t h e DSPGA372 package int roduced in Sect ion 7.3.1. 


Figure 8.3: Packaged RN1 IC 


164 






























9. Metro 


Th e Mult ipat h Enh anced Transit Rout ing OrganizaMEffiR0) is an arch it ect ure for second 
generat ion Transit rout ing component s. TiMETRO arch it ect ure encompasses t Imbp-ROUTER 
prot ocol describ ed in Ch apt er 4 including t h e enh ancement s describ ed in Sect ion 4.9. In addit ion 
t o t h e b asic rout er prot ocol implement ed b y fiMRo includes mult i-TAP scan, port -b y-port 
deselect ion, part ial-ext ernal scan, w idt h cascading, fast pat h reclamat ion, and pipelining provisions. 

9.1 metro Architectural Options 

Th eMETRO arch it ect ure encompasses a large space of rout ing component figprat ions and 
rout er b eh avior. Some arch it ect ural paramet ers m fetch w h en const ruct ing a part icular rout ing 
component. Any part icular rout er w ill h aferad dat a w idt h-),(a fixed numb er of input s and 
out put s o), a fixed numb er of pipeline delays rout ing dat a t h rough t h e rdpfefer a( fixed 
numb erofh eader w ords sw allow ed during connect ion est ab l/sh-)mind affixed numb er of 
scan pat h s.sfj). Each part icular rout er w ill h avcfigumt ion opt ions, accessib le via t h e TAP, 
w h ich allow one t o ch oose among a set of possib le rout er b eh aviors. (BgereaanjoMETRO 
rout ing component t o act as a radi/x.-dilat lorn I rout er (p = rx d) b y set t ing t h e effect ive dilat ion. 

Each part icular rout er w ill h ave a maximum limit on t h e dilat ion setrtdn§.( Each forw aid 
and b ackw ard port on an^ETRO router can b e enab led or disab led (See Section 4.9.1). It is 
also possib le t o ccfigure each forw ard and b ackw ard port oMEffjtO rout ing component t o 
accommodat e t h e pipeline delay cycles associat ed w it h pipelining dat a on t h e w ires b et w een rout ers 
(See Sect ion 4.11.3). Th e ccfigured value, vtd, defines t h e numb er of cycles w h ich w ill transpire 
b et w een t h e t ime w h en a portTfcHtldand t h e t ime w h crfii t4t piece of ret urn dat a arrives. 

Each part icular rout ing component w ill alkfid tot ake on any value up t o some component 

specific maximum, ma.r_rld. Each forw ard and b ackw ard port can also bigucoti t o cit h er use 

fast pat h reclamat ion or det ailed connect ion sh ut dow n (See Sect ion 4.9.2). Tab le 9.1 summarizes 

t h e arch it ect ural variab les w h ich must b e select ed during t h e const ruct ion of a rout ing component. 

Tab le 9.2 summarizes t h c dbgurat ion opt ions w h ich areavailab lMBiao rout ing component . 

9.2 metro Technology Projections 

Based on our experience w it h RN1 and t h e signalling t ech nology describ ed in Ch apt er 6, w e 
b elieve w e can b uild a comparab ly sMEdRO rout er in Hew let t Packard.8//m effect ive gat c- 
lengt hCMOS process (CMOS26) w h ich operat es at clock frequencies up t o 200 MHz. As not ed in t h e 
previous sect ion, the crit ical pat h for connect ion est ab lish rnent in RN1 w as under 14 ns. Th rough 
a comb inat ion of t ech nology scaling and clever circuit t ech niques, w e can reduce t h is allocat ion 
lat ency t o 5 t o 10 ns. Thflew -t h rough lat ency on RN1 w as much less t h an t h e allocat ion lat ency. 
Rout ing dat a b et w een a forw ard port and b ackw ard port of an open connect ion w it h less t h an 5 ns 
of lat ency sh ould b e quit e manageab le. Consequent ly, w e expect a part operat ing at 200 MHz can 


165 



Variable 

Function 

Range 

sp 

Numb er of Scan Pat h s 

sp > 1 

w 

Bit Widt h of Dat a Ch annel 

w > maxjd 

maxjd 

Maximum Dilat ion 

maxjd = 2 n for some n > 0 

maxjd < o 

i 

Numb er of Forw ard Port s 

i = 2" for some n > 0 

o 

Numb er of Backw ard Port s 

o = 2 n for some n > 0 

o > maxjd 

ri 

Numb er of Random Input s Bit s 

ri > 1 

hw 

Numb er of Header Words Consumed 

Per Rout er 

hw > 0 

dps 

Numb er of Dat a Pipest ages Th rough Rot 

it (kps > 2 

maxjvtd 

Maximum Numb er of Delay Slot s 

Availab le for Variab le Turn Delay 

max_vtd > 0 


Th is t ab le summarizes the arch it ect ural variab les w h ich dist inguish aMp'paot icular 
rout ing component. 


Tab le 9.1 METRO Arch it ect ural Variab les 


Option 

One for 

Each 

Number of 

Instances 

Bits Each 

Dilat ion (/) 

component 

1 

[log 1 (log 2 (max_d))\ + 1 

Port (De)select 

port 

i + 0 

1 

Deselect ed Port Drives Out pu 

t port 

i + o 

1 

Fast Reclamat ion 

port 

i + o 

1 

Turn Delay (rid) 

port 

i + o 

1 

C9 

O 


Each METRO rout er h as several cdhgurat ion opt ions w h ich cont rol it s b eh avior. Th is t ab le 
summarizes the opt ions common t o idfTRO rout ing component s. 


Tab le 9.2: Mtro Rout er Corfigurat ion Opt ions 


b eb uilt w it h one cycle of dat a lat/pscy (1) and one or t w o cycles of connect ion est ab lish ment 
lat ency (7; ir = 0 or hw = 1). Using t h e i/ o pads det ailed in Ch apt er 6, t h e delay t h rough a pair of 
i/ o pads is 3 ns. As long as t h e propagat ion delay b et w een is midpoint s is under 2 ns, t h e i/ o 
pads and int erconnect serve as a single pipeline st age. Wit h convent ional P(Z,B4 4), w ire runs 
up t o 30 cm in lengt h can bet raversed in a single 200 MHz clock cycle. 


166 




























































10. Modular Bootstrapping Transit Architecture (MBTA) 


Th e Modular Boot st rapping Transit Arch it ect ure (MBTA) is a series of small mult iprocessors b ased 
around mult ist age rout ing net w orks composed of RN1 METRO rout ing component s. MBTA 
int egrat es a numb er of minimal processing nodes w it h a mult ist age net w ork organized as describ ed 
in Sect ion 3.5. 

10.1 Architecture 

Figure 10.1 sh ow sthenetw ork used for a 64-processor MBTA rnach ine. Each processing node 
h as t w o net w ork input s and two net w ork but pwt s=(2) for fault t olerance. Th e net w ork 
sh ow n is composed of RNl-st yle rout ing component and uses t h e dilat ion-1 roufigmatrion 
in t h final st age so t h at t w o different rout ing component s may provide net w ork out put s from 
t h e net w ork t o each processing node. Since RN1 is a radix-4 rout ing component, t h e net w ork is 
comprised of log 4 ( 64) = 3 rout ing st ages. 

Figure 10.2 sh ow s t h e arch it ect ure of t h e MBTA processing nodes. Each node is composed 
of a RISC microprocessor (e.g. Int els 80960CA [MB88] [Int 89]), fast, st at ic memory, net w ork 
int erfaces, and support logic. Four logical net w ork int erfaces service t h e t w o connect ion int o and 
t h e t w o connect ion out of t h e net w ork. Th e processor performs comput at ion, init iat es net w ork 
communicat ions, and services non-primit ive net w ork operat ions. Th e processor is also responsib le 
for t h e h igh est levelMBf-ENDPOlNT, w h ich are not h andled b y t h e net w ork int erf ace. A single, 
h igh -speed memory b ank serves t o h old inst ruct ions and dat a for t h e processors, st ore dat a coming 
and going from t h e net w ork, and st ore connect ion st at us informat ion. Th e b asic node arch it ect ure 
also h as provisions t o support co-processors and alt ernat e forms of memory. In order t o int erface 
MBTA mach ines w it h exist ing comput ers and dat a net w orks, t h ere are provisions for some nodes 
t o accommodat e ext ernal int erfaces. 

10.2 Performance 

Th e MBTA arch it ect ure h as b een b alanced t o support b yt e-w ide net w ork connect ions running 
at 100 MHz. Th e net w ork int erfaces send dat a from t h e fast, st at ic memory and receive dat a 
int o t h e memory, as w ell. Consequent ly, each net w ork int erface requires 100 megab yt es/ second 
(100 MB/ s) of b andw idt h int o memory during sust ained dat a t ransfers. Th e processor is running 
at 25 MHz and may read up t o one w ord, or four b yt es, per cycle during b urst memory operat ions. 

To prevent t h e processor from st ailing, it, t oo, needs 100 MB/ s of b andw idt h int o memory. To run 
all net w ork int erfaces and t h e processor sirnult aneously at full-speed, w e w ould need 500 MB/ s of 
b andw idt h int o memory. To simplify t h e prob lem, w e rest rict operat ion so t h at only one net w ork 
input may b e feeding dat a int o t h e net w ork at a t ime. Th is rest rict ion limit s t h e cont ent ion in t h e 
net w ork w h ile giving us t h e fault t olcranfic fiofrlc aving t w o connect ions int o t h e net w ork. 

To provide t h e 400 MB/ s of b andw idt h required, w e use 64-b it w ide, 20 ns, synch ronous SRAM 


167 











Address Bus 


Data Bus (64) 


Network Ports (byte wide + control) 


Sh ow n ab ove is t h e arch it ect ure for each MBTA node. Th e unit s out side of t h e dot t ed b ox 
are common t o all MBTA nodes. Wit h in an MBTA mach ine, a few nodes w ould support 
ext ernal int effaces. 


Figure 10.2: MBTA Node Arch it ect ure 


(30) 


169 

















for t h eh igh -speed memory on a pipelined b us. Each of t h e four unit s using t h e memory get s t h e 
opport unit y t o read or w rit e one, 8-b yt e value t o or from memory every 80 ns. Th is allow s each 
unit t o sust ain 100 MB/ s dat a t ransfers w it h out int ernal b uffering as long as dat a can bet ransferred 
as cont iguous doub le-w ords. 

If all nodes are b usy sending dat a at 100MB/ s, a 64-processor net w ork, like t h e one sh ow n 
in Figure 10.1, can support a peak b andw idt h of 6,400MB/ s = 6.4GB/ s. With one net w ork input 
and b ot h net w ork out put s in operat ion, a single node can simult aneously t ransfer up t o 300MB/ s. 
Running t h METRO component describ ed in Sect ion 9.2 at 100 MHz, it t akes one cycle t o t raverse 
each rout er and one cycle tot raverse each w ire in t h e net w ork. Th e unloaded lat ency t h rough 
t h e net w off Unloaded,$ is 70 ns arising from 10 ns of lat ency t h rough each of t h e t h ree rout ing 
component s in any pat h t h rough t h e net w ork and 10 ns of lat ency t h rough each of t h e four ch ip 
crossings b et w een net w ork endpoint s. If our t ech nology project METlfOrh old, w e could 
implement a version w it h RNl-st yle pipelining and cut th is lat ency in h alf. Alt ernat ely, if w e could 
cycle the pipelined memory b us t w ice as fast or increase t h e memory w idt h to 128-b it s and require 
16-b yt e dat a t ransfers, w e could support 200 MB/ s net w ork connect ions uvillEHchi'out er at 
full speed. Th is w ould cut t h e unloaded net w ork lat ency in h alf t o 35 ns. Th is ch ange w ould also 
doub le t h e b andw figilifas ab ove and cut t h e t ransmission t i'/pq,in h alf. For t h is size 
of a net w ork, t h e t ot al t ime t o communicat e a message from one node t o ai/qt- hsTp^ w ill 
b e dominat ed b y t h e net w ork input and out put Ijl andy7„ and t h e t ransmission lat ency, 

7" transmit • 


170 



11. Metro Link 


METRO Link (mlink) is a net w ork int erface designed t o connect t h e processor and memory on 
an MBTA node t o a METRO net w orkMLlNK h andles t h e core port ions Mfip-ENDPOINT (See 
Sect ion 4.7) and provides support so t h e node processor can h andle the remainder. 

11.1 mlink Function 

MLINK performs all of t h e low -level operat ions necessary for an endpoint t o send and receive 
dat a over a METRO net w orkMLlNK h andles cont rol and signalling w h ich must operat e at the 

net w ork speed. It also h andles t h ose operat ions w h ich must b e implement ed in h ardw are t o exploit 

t h e full b andw idt h of t h e net w ork port s and keep end-t o-end net w ork iM,ietK)ldaMes. 
infrequent diagnost ic operat ions, cert ain kinds of message format t ing, and policy decisions t o t h e 
node processor. 

mlink’s primary funct ion is t o convey dat a b et w een the net w ork arid taocmdiry . MLINK 
moves dat a b et w een t h e doub le-w ord w ide memory b us, on w h ich it get s one cycle once every 

80 ns, and t h e b yt e-w ide net w ork port operat ing at I OOlMtilte.adds cont rol b yt es t o t h e 

dat a st ream i.g. ROUTE, TURN, DROP) t o open, reverse, and close net w ork connect ioiMLlNK 
also generat es and verfies the end-t o-end message ch ecksums used t o guard message t ransmission. 

MLINK w ill ret ry failed connect ions w it h out processor int ervent ion up t o some procesfmdspeci 
numb er of t rials. A pair ofll.lNK net w ork input s w ill arb it rat e using randomizat ion t o det ermine 
w h ich input is used for each connect ionMrMK net w ork out put s can h andle t h e recept ion of 
a small set of primit ive messages (See Sect ion 11.3) w it h out processor int ervent ion. For all ot h er 
messages, MLINK queues t h e incoming dat a t o b eh andled b y t h e processor 

Operat ions w h ich are more complicat ed and infrequent are left t o t h e node processor. Th e 
processor is responsib le for packet launch and for source-queuing of messages w h ile t h e net w ork 
input is b usy. Th e processor det ermines h ow t o proceedMUltoKefiails t o deliver a message in 
t h e spefiied numb er of t rials. Th e processor is also responsib le for allocat ing space for incoming 
remot e funct ion invocat ion messages and for processing and dequeuing t h e messages as t h ey arrive. 

Wh en cofigured t o do so, MLINK w ill st ore connect ion st at us informat ion for successful and 
failed connect ions. Th is informat ion includes t h e st at us and ch ecksurn w ords ret urned from each 
rout er in an allocat ed pat hMLlNK leaves t h e t ask of int erpret ing t h is informat ion t o t h e processor. 

A few t asks are also left t o t h e processor in order t o limit t h MinNaalneeds t o know ab out 
t h e message prot ocol or at t ach ed net w ork. Th e remot e funct ion invocat ion prodidkmaumfd 
flexib le opport unit y t o cust omize low -overh ead messages for a paid icular applicMLiMK only 
provides b asic t ransport and queuing of t h ese message t ypes, leaving format t ing and int erpret at ion 
t o t h e processoMLlNK also leaves the select ion of rout ing w ords t o t h e processor. Th is allow s 
t h e processor t o select a part icular pat h in ext ra-st age, mult ipat h net w orks (See Sect ion 4.7.1) and 
prevent s MLINK from needing t o know t h e det ails of format t ing rout ing w ords for any paid icular 
net w ork (See Sect ion 4.6). 


171 



11.2 Interfaces 


Onth e node sid^dLlNK connect s t o t h e64-b it. pipelined b us. Th isb us serves tw o purposes for 
each MLINK int erface. Recall from Ch apt er 10 t h at each net w ork int erface on an MBTA node h as 
a designat ed cycle on t h e pipelined b us once every 80 MLINK uses t h is slot t o read or w lit e dat a 
from t h e fast memory at 100MB/ s. Th e processor also h as a designat ed slot on t h e pipelined b us. 
During t h e processor slot. it may read or w lit e 32-b it values at memory-mappBfllK addresses. 

Th ese memory-mapped addresses allow t h e processor t o: 

1. configure each MLINK 

2. launch or ab ort net w ork operat ions 

3. ch eck on t h e st at us of eMifclNK’s ongoing or recent ly complet ed operat ions 

On the net w ork sMcjnk h as a b yt e-w ide net w ork port w h ich b chllilliRlS iika\a ard 
or b ackw ard port. Th e net w ork port h as t h e same ftguraf in nn opt ions as eachMETRO 
forw ard and b ackw ard port (See Tab le 9.2). 

11.3 Primitive Network Operations 

MLINK dist inguish ch ve kinds of primit ive net w ork operat ions: 

1. READ 

2. WRITE 

3. RESET 

4. NOOP 

5. ROP (rernot e funct ion invocat ion) 

Th tRESET, NOOP, READ, and WRITE operat ions are h andled ent irely b y t h e receiMirlgJK w it h out 
involving t h e processor, w h ereas t h e remot e funct ion invocat ion is only qudUfcjdNlK y o b e 
h andled b y t h e node processor. Th.AD operat ion performs a mult i-w ord, memory read operat ion 
on t h e remot e node, ret urning t h e dat a at t hfedpsaMress t o t h e source. ThVRlTE operat ion 
performs t h e complement ary funct ion, allow ing dat a t o b e w lit t en int o a remahemortp 
Th ese operat ions are direct, h ardw are reads and w lit es and are associat ed w it h no guards for 
coh erence. Th RESET operat ion signalsMLlNK t o release t h e associat ed processor and allow it t o 
b oot . Comb inedw it forarhEeperat ion, t h is primit ive allow seach nodetob c rcmotfkgliyiedii 
and b oot ed across t h e net w ork.NT)h>ieopcrat ion performs no funct ion on t h e dest inat ion node 
b ut does ret urn connect ion st at us informat ion w h ich is useful during t est ing. 

Remot e funct ion invocat ion is a generic primit ive w h ich allow s soft vfiguratiion of 
arb it rary message t ypes and remot e net w ork funct MMNK simply conveys t h e spefiied dat a 
and a dist inguish ed address from t h e source endpoint t o t h e dest inat ion via t h e net w ork. Th e 
dest inat ioiMLlNK queues t h e arriving dat a and address on t h e incoming message queue for t h e 


172 



mlink Message Formats: 

(ROUTE)* o RESET o (data..;..,. . .. ) 2 o TURN 
(ROUTE)* o NOOP o (DATA efcsm „) 2 o TURN 
(ROUTE)* o READ o len o (DATA a(Wr ) 3 o (DATA efcstIm ) 2 o TURN 

(ROUTE)* o WRITE o len o (DATA ^*.) 3 o (DATA efcsm „) 2 o (DATA t( ,„; te ) (8 ' len 1 o (DATA..., . >r f ° TURN 
(ROUTE)* o ROP o len o (DATA addr ) 3 o (DATA efcstIm ) 2 o (DATA)' 8 ' len 1 o (DATA efcsm „) 2 o TURN 

mlink Reply Formats: 

(DATA m / z ’ n k _st a tus) ° ACK/NACK o DROP 

(DATAyT 'ilink ^status') o (DATA_IDLE)* o (DATA ,. .,,/) 8 :■ (DATA efcsm „) 2 o DROP 


Each primit ive message t ype h as it s ow n init ial message format. Wh ere det ermined b y 
MLINK, superscript s indicat e t h e numb er of b yt es composing each port ion of t h e message 
dat a. Th e read operat ion is t h e only primit ive message w h ich receives dat a along w it h it s 
reply message. 


Figure 11.1: MLINK Message Format s 


dest inat ion processor t o service. Th e dest inat ion processor dequeues each message and invokes the 
funct ion at t h e spdcid address w it h the associat ed daF(tt2| kails t h is kind of low -overh ead, 
remot e code invocat ion arActive Message. Th is primit ive export s t h e b asic funct ionalit y of t h e 
net w ork t o t h e soft w are level w h ere cust om message h andlers can b e craft ed in soft w are for each 
applicat ion or run-t ime syst em. 

Figure 11.1 summarizes the message format s used bviyiNK net w ork int erfaces t o perform t h e 
primit ive net w ork operat ions. Wh ere appropriat e, t h e t arget address and dat a lengt h are guarded 
w it h t h eir ow n ch ecksurn so t h at dat a can b e w rit t en int o memory w h ile it is b eing received (See 
Sect ion 4.7.4). In reply t o avRlTE, RESET, NOOP, or ROP message, MLINK sends st at us informat ion 
and an acknow ledgement. Wh en replying t o an uncorrupttledD operat ion,MLlNK ret urns the dat a 
associat ed w it h the rcad.DTh'AclDLH w ords preceding t h e read dat a in t h e read reply message 
are used t ofill in any delays b efore t hfkst b yt e of read dat a is availab le. Th is delay arises part ially 
from t h e need tow ait fo rftr ht (read dat a on t h e node dat a b us and part ially from t h e ifiMd t o 
pipeline st ages w it hVffllNK w it h reply dat a. 

Th ese primit ives form a minimal set of net w ork primit ives. Th ey provide ah igh -ddgiee of 
ib ilit y for a general-purpose mult iprocessor. In sit uat ions w h ere t h e applicat ion and programming 
model are limit ed or b iased t o a part icular domain, it may make sense t o cust ornize a net w ork int er- 
face w it h addit ional net w ork primit ives implement ed direct ly in h ardw are w it h out’st h e processor 


173 



int ervent ion. For inst ance, w h en b uilding a dedicat ed, sh ared-memory mach ine w it h 
memory-model in mind, it w ould b e hf kriia l t o provide prirnit ive net w ork operat ions t o h 
coh erent memory t ransact ions. 


a part icular 
andle 


174 



12. MBTA Packaging 


In Sect ion 7.4, w e sh ow ed h ow t o package a net w ork using t h e st ack packaging t ech nology 
int reduced in Ch apt er 7. Here, w e consider packaging an ent ire 64-processor MBTA mach ine. 

12.1 Network Packaging 

Packaging t h e net w ork is a simple applicat ion of t h e net w ork t o st ack packaging mapping 
introduced in Section 7.4. As sh ow n in Figure 10.1, a 64-processor net w ork b uilt out of RN1 
component s, or comparab ly sizcdli;i RO component s, h as 16 rout ers in each st age. We arrange 
t h ese rout ers in a >4 4 grid arrangement as sh ow n in Figure 12.1. Wit h the rout ers packaged in 
DSPGA packages and placed on 3 inch cent ers, each rout ing b oard is rough ly 12 inch es square. 

12.2 Node Packaging 

We can package t h e nodes inside t h e same st ack b y h ousing the larger, VLSI component s in 
DSPGA packages and using gull-w ing surface-mount component s for memory and b us logic. By 
sh aring t h e i/o pads associat ed w it h the 64-b it, node dat a b us, w e can int egrat e all four logical 
net w ork int erfaces on a single die and place the die in a DSPGA package. Th e processor and 
cust orn b us cont rol logic can each b e placed in t h eir ow n DSPGA package. Th e memory can b e 
ob t ained in gull-w ing, surface-mount packages. Th e b us int erface logic can b e packaged in SSOP 
packages w it h a 25 mil pad pit ch [Tex91], By adding a fourt h DSPGA package, w e can package a 
node on a 6 inch square PCB w it h the DSPGA component s cent ered 3 inch es apart. Th e memory 
and glue logic can b e placed on t h e surface of t h e node PCB b et w een the DSPGA packages as 
sh ow n in Figure 12.2. Th e fourt h DSPGA package can b e used eit h er t o h ouse addit ional node 
logic or as a b lank for mech anical support and vert ical signal cont inuit y. Th is arrangement allow s 
us t o st ack four node PCBs on t op of t h e net w ork rout ing b oards and align t h e DSPGA packages in 
t h e net w ork and nodes (See Figures 12.3 and 12.1). 

To accommodat e addit ional logic or memory for each node, w e can b uild daugh t er b oards of 
t h e same size and use vert ical connect ivit y t o int erconnect t h e b oards. As long as t h e signals w h ich 
connect t o t h e addit ional logic or memory are availab le on t h e pads of one of t h e four DSPGA 
packages on t h e node, an adjacent PCB h as access t o t h ese signals. A DRAM memory card, for 
example, could b e b uilt b y h ousing t h e DRAM cont roller in a DSPGA package, and packaging t h e 
DRAM in TSOP packages b et w een t h e DSPGA component sit es. Blank DSPGA packages w ill b e 
necessary in any unused DSPGA grid sit es. 

12.3 Signal Connectivity 

Each node needs t o b e connect ed t o t w o net w ork input ch annels and t w o net w ork out put 
ch annels. We use t h e t h rough vias on t h e DSPGA packages t o vert ically connect each node int o 


175 



~ 12 ' 



Th e 16 rout ers in each st age of t h e net w ork (See Figure 10.1) are arrangeckihligifd. 
Th e rout ers are h oused in DSPGA packages and spaced 3 inch es apart. 


Figure 12.1: Rout ing Board Arrangement for 64-processor Mach ine 


176 
















































































































































128Kx8 

SRAM 





By h ousing the processor, net w ork int erface, and b us cont rol logic in DSPGA packages, 
w e can const ruct a 6 inch square node suit ab le for st ack packaging. Th e fourt h DSPGA 
can b e b lank or h ouse addit ional logic, such as an opt ional co-processor. Memory and b us 
int erface logic are h oused in gull-w ing surface-mount packages in t h e space b et w een the 
DSPGA packages. 


Figure 12.2: Packaged MBTA Node 


177 







































Four nodes can b e arranged in one 12 inch square st ack layer w h ich mat es mech anically 
and elect rically w it h the rout ing component layers (Figure 12.1). 


Figure 12.3: Layer of Packaged Nodes 


178 






























































































































































































t h e net w ork layers. As sh ow n in Figure 12.3, w e can place four nodes in each st ack layer ab ove, 
or b elow , t h e group of t h ree net w ork b oards. To h ouse 64 nodes^v=e Meedch layers 
of nodes. We can place h alf of t h e nodes ab ove and h alf b elow t h e net w ork layer t o minimize t h e 
dist ance of any node from t h e net w ork. Th is segregat ion leaves us w it h eigh t layers of nodes on 
each side of t h e net w ork. Th e vert ical t h rough signals on each node must run net w ork connect ions 
for t h e eigh t nodes in each node column on each side of t h e net w ork. Each of t h e eigh t nodes t aps 
off t h e appropriat e sub set of t h ese signals t o connect int o t h e net w ork. Mech anically, the node 
arrangement describ ed h as a rot at ional symmet ry of four. Wit h proper signal arrangement, w e can 
exploit t h is symmet ry t o allow a single node PCB design t o t ap int o any of four different vert ical 
signal runs. We can t ap int o t h e eigh t different vert ical signal runs using only two different b asic 
node designs. In t h e net w ork layers, t h e vert ical t h rough int erconnect w ill b e used t o arrange t h e 
net w ork input s and out put s so t h at h alf of t h e input s and h alf of t h e out put s are availab le on each 
side of t h e net w ork. 

12.4 Assembled Stack 

Figure 12.4 sh ow s an exploded view ofth e packaged 64-processor mach ine. Ext ernal int erfaces 
mat e w it h the nodes in t h e t op-most node layer using t h e same vert ical int erconnect sch eme 
suggest ed for node daugh t er b oards. Th e complet e st ack h ouses t h e net w ork and all 64 nodes in a 
cub ic st ruct ure rough ly"12 12" x 5" (See Figure 12.5). 


179 




A complet e 64-processor mach ine st ack is composed of 3 net w ork layers (Figure 12.1) and 
16 node layers (Figure 12.3). Tw o different node PCB designs coupled w it h t h e rot at ional 
symmet ry of t h e node PCBs, allow each of t h e eigh t nodes in a vert ical column t o t ap int o 
different net w ork connect ions. 


Figure 12.4: Exploded Side View of 64-processor Mach ine St ack 


180 


16 " 



Scale Draw ing 

Figure 12.5: Side View of 64-processor Mach ine St ack 


181 



Part IY 

Conclusion 


182 



13. Summary and Conclusion 


We h ave examined t h e lat ency and fault t olerance associat ed w it h large, mult iprocessor net w orks. 

We developed t ech niques at many levels for b uilding low -lat ency net w orks. We also developed 
net w orks capab le of sust aining fault s and t ech niques allow ing proper operat ion of t h ese net w orks 
in t h e presence of fault s. In t h e development, w e found no inh erent incompat ib ilit ies b et w een our 
goals of low lat ency and fault t olerance. Rat h er, w e found commonalit y b et w een t ech niques w h ich 
decrease lat ency and t h ose w h ich improve fault t olerance. 

Consequent ly, w e w ere ab le t o ident ify a rich class of net w orks w it h good lat ency and fault - 
t olerant ch aract erist ies. We paramet erized t h e net w orks in t h is class in several w ays. We developed 
an underst anding of h ow t h e net w ork paramet ers effect net w ork propert ies. Th is underst anding 
allow s us t o t ailor net w orks t o meet t h e requirement s of paid icular applicat ions. 


13.1 Latency Review 


Comb ining t h e lat ency cont rib ut ions from Sect ion 2.4 and collapsing int o a single equat ion. w e 
get: 


T n et 



( 

dj 

\ 

" L~ 

7 (applicat iopt opolog}) • 

&n * (J'io + ^switch) “l - ^i 

tc 

■t e + 

w 


(13.1) 


We see t h at t h ere are many aspect s w h ich cont rib ut e t o net w ork lat ency. To ach ieve low lat ency, 
w e must pay at t ent ion t o all pot ent ial lat ency cont rib ut ors and w ork t o simult aneously minimize 
t h eir effect s. In Ch apt ers 3 t h rough 7, w e addressed all of t h ese lat ency component s and examined 
w ays t o minimize t h eir cont rib ut ions. 

We considered h ow t o minimize t h e t ransit t ime b et w eeff t i)out ers ( 


T, = Z; 


(13.2) 


Th is lat ency is det ermined b y t h e speed of propagat dcmsd t h e t ot al dist ance t raversfed; Z; d -. 

We saw t h at t h e maximum speed of signal propagat ion w as det ermined b y mat erial propert ies. 

1 

v = - 

We also saw t h at t h is maximum w as only ach ievab le w it h proper signal t erminat ion. In Ch apt er 6. 
w e saw signalling t ech niques for ach ieving t h is maximum rat e of propagat ion w it h minimal pow er 
dissipat ion. We saw t h at t h e t raversed int erconnect drit depends on t h e grow t h ch aract er¬ 
ist ies of t h e net w ork t opology and the ach ievab le packaging densit y. In Ch apt er 3, w e looked at 
t h e int erconnect dist ance grow t h ch aract erist ies for a large class of net w orkfiaakt hlmfci 


183 



w it h the most favorab le grow t h ch aract erist ics. In Sect ion 2.4.2, w e not ed t h at, in some sit uat ions, 
localit y can b e exploit ed t o minimize, on average, t h e dist ances w h ich must bet raveled inside 
t h e net w ork. Consequent ly, in Ch apt er 3 w e alsfaciilamt iw ork t opologies w it h localit y. In 
Ch apt er 7, w e looked at t ech nologies for h igh -densit y packaging. We also looked at t opologies for 
mapping net w orks ont o t h e packaging t ech nology in a w ay t h at exploit s t h e densit y t o minimize 
int erconnect ion dist ances. 

We considered h ow t o minimize t h e t ot al numb er of rout ers w h ich must bet raversed in a 
net w ork„. InCh apter3,w e found th at log st ruct ured sort ing net w orks gave us th elow est numb er 
of sw it ch es as long as w e rest rict ed ourselves t o b ounded degree sw it ch ing nodes (Sect ion 2.7.1). 

We also not ed t h at t h e rout er radgfves us a paramet er w e can use t o cont rol t h e act ual numb er 
of sw it ch es t raversed in an implement at ion. Again, for some applicat ions localit y exploit at ion 
may allow us t o furt h er reduce t h e average numb er of sw it ch es t raversed w h en rout ing t h rough t h e 
net w ork. 

We not ed t h at t h e lat ency cont rib ut ed b y each rout ing component w as composed from t h e 
sw it ch ing t ime and t h e i/ o lat ency. 

tnl — tio T f switch ( 13.3) 


InCh apter6, w eiddretd a signalling discipline w h ich minimized t ransit andch ip i/olat encies. We 
looked at t ech nologies for implement ifiglOS drivers and receivers for t h is signalling discipline 
and saw h ow t o design circuit ry for realizing low -lat ency i/ o. In Ch apt er 4, w e developed a 
simple rout ing prot ocol t h at w as w ell mat ch ed t o t h e capab ilitvies dCcimplement at ion 
t ech nologies. Th e rout ing sch eme comb ines simple, local decision making w it h a minimum 
complexit y rout ing prot ocol t o allow t h e sw it ch to perform all of it s funct ions quickly. 

To keep th c cont ent ion lat cnpynnd t h c t ransmission t iiriT,,,, low ,w e looked at h ow 
t o provide h igh b andw idt h in t h ese net w orks. We saw t h at increasing t h e b andw idt h availab le f 
of each connect ion w ill decrease t h e t ransmission lat ency. 


Ttransmit 



(13.4) 


We can increase t h is b andw idt h eit h er b y increasing t h e signaling batydncrcasing t h e dat a 
ch annel w idt th. ,We can also not e t h at t h e low er t h e t ransmission Yqpqqgy,,,, t h e fast er 
t h e resources used b y a connect ion are freed. As a result, decreasing t ransmission lat ency w ill also 
decrease cont ent ion lat ency. 

We not iced t h at w e can oft en reliab ly send dat a fast er t h an t h e dat a can t raverse w ires or rout ing 
component s. As a result, w e saw t h at pipelining t h e t ransmission of dat a oft en allow s us t o decrease 
t h e signalling clock/, , considerab ly, and h ence increase b andw idt h , w it h out any negat ive impact 
on lat ency. To t h is end, w e sh ow ed h ow t h e rout ing prot ocol can accommodat e pipelining of dat a 
across w ires and inside rout ers (Sect ion 4.11). In Ch apt er 6, w e also saw h ow t h e i/ o circuit ry and 
signalling discipline allow us t o reliab ly pipeline b it s across w ires of arb it rary lengt h . 

Furt h er, w e saw t h at cont ent ion lat ency arises from inadequat e or improperly ut ilized resources 
inside t h e net w ork. We saw in Ch apt er 31 h at dilat ed rout ers gave connect ions a ch oice of resources 
t o ut ilize t h rough out t h e net w ork. Th is freedom reduced t h e likelih ood t h at b locking w ill occur 
w it h in t h e net w ork and h ence reduced cont ent ion lat ency. In Sect ion 4.9.2, w e saw h ow fast pat h 
collapsing reduced cont ent ion lat ency furt h er b y quickly reclaiming resources allocat ed t o b locked 
connect ions. 


184 



In Ch apt er 3, w e also saw t h at w eh ave several opt ions for reducing cont ent ion lat ency: 

1. We can increase t h e per connect ion b andw idt h b y increasing t h e ch annel w idt h or signalling 
frequency, as describ ed ab ove. 

2. We can also increase t h e aggregat e b andw idt h of t h e mach ine b y increasing t h e numb er of 
input and out put connect ions b et w een each node and t h e net andocb, 

3. We can decrease t h e likelih ood of b locking b y increasing t h e dibit ism.t h at each 
connect ion h as more opt ions at each rout ing st age. 

13.2 Fault Tolerance Review 

In Sect ion 2.5, w e saw t h at w e cannot depend on t h e correct operat ion of every component in 
t h e net w ork if w e need t o ach ieve reasonab le MTTF for large-scale mult iprocessor net w orks. We 
also saw t h at t h e ab ilit y t o operat e in t h e presence of even a small numb er of fault y component s 
improves our syst em reliab ilit y, considerab ly. Th is ob servat ion led us t o look for net w orks in w h ich 
w e could maximize the dist inct resources availab le t o make any connect ions and h ence t o minimize 
t h e likelih ood t h at any set of fault s w ill render the net w ork disfunct ional. 

In Sect ion 2.1.1, w e not ed t h at transient fault s w ere much more likely t h an permanent faults. 

Th is fact, coupled w it h th e single-component fault rat e derived in Sect ion 2.5, led us t ob e concerned 
w it h rob ust operat ion in t h e face of dynamically arising fault s. We found t h at w e must devise 
prot ocols w h ich do not assume t h e correct operat ion of any component in t h e net w ork at any point 
in t ime. Rat h er, w e must arrange t h e prot ocol t o verify t h e int egrit y of each net w ork operat ion. 

In Sect ion 3.3 w e examined mult ipat h net w orks and not ed t h eir pot ent ial for providing fault 
t olerance. In t h ese net w orks, t h e mult iple pat h s b et w een endpoint s use different rout ing resources. 

Th ese alt ernat e rout ing resources provide t h e b asis for fault -t olerant operat ion. Wh en a fault y 
component renders one pat h inoperat ive, anot h er pat h is availab le w h ich avoids t h e fault y compo¬ 
nent . In Sect ion 3.5 w e examined many of t h e det ailed w iring issues associat ed w it h mult ipat h , 
mult ist age net w orks. We saw t h at t h e numb er of connect ions t o each ?,i eand^ointis the 
w eakest link b et w een a node and a mult ipat h net w ork, and w e saw h ow t o make t h e b est use of 
t h e endpoint connect ions availab le in a part icular net w ork. We also visit ed t h e issue of w iring t h e 
mult iple pat h s inside t h e net w ork t o maximize fault t olerance. We saw different evaluat ion crit eria 
b ased on w h et h er net w ork connect ivit y is view ed as a yield prob lem or as a h arvest prob lem. If w e 
allow node isolat ion, w e saw t h at randomly-w ired net w orks generally b eh ave most rob ust ly in t h e 
face of fault s. Wh en node isolat ion is not permit t ed, w e found t h at det erminist ic, maximum-fanout 
net w orks generally survive more fault s. 

We also saw t h at w e can cont rol t h e amount of redundancy, and h ence fault t olerance in t h ese 
net w orks, b y select ing t h e rout er dildt knnLt h e numb er of node input and out put connect ions, 
ni and no. We can adjust ni and no t o cont rol t h e mean t ime t o node isolat ion. For t h e h arvest 
case, w h ere node isolat ion is not allow ed, increasing t h e numb er of node input s and out put s is 
prob ab ly t h e most effect ive w ay of increasing fault t olerance. We can adjust t h e dilat ion t o cont rol 
t h e amount of pat h fanout w it h in t h e net w ork and h ence t h e numb er of pat h s provided b et w eer 
endpoint s. Increasing t h e dilat ion is effect ive for increasing fault t olerance inboth thetheh arvest 


185 


and yield sit uat ions. How ever, due t o t h e different reliab ilit y met rics w e use in t h ese two cases, 
increasing t h e dilat ion is much more effect ive in t h e yield case t h an in t h eh arvest case. 

Mult ipat h net w ork t opology, h ow ever, only gave us t h e pot ent ial for fault -t olerant operat ion. To 
realize t h at pot ent ial, w e not ed t h at t h e rout ing sch eme must b e ab le t o det ect w h en failures occur and 
b e ab le t o exploit t h e mult iple pat h s t o avoid fault s. In Ch apt er 4, w e developed such a sch eme for 
rout ing on t h e mult ist age, mult ipat h net w orks det ailed in Ch apt er 3. End-t o-end message ch ecksums 
guard each dat a t ransmission against unnot iced corrupt ion. End-t o-end acknow ledgment s and 
source-responsib le ret ry w ork t oget h er t o guarant ee each message is delivered at least once w it h out 
corrupt ion. Random select ion of a paid icular pat h t h rough t h e mult ipat h net w ork coupled w it h 
source-responsib le ret ry, guarant ees t h at any non-fault y pat h b et w een any source-dest inat ion pair 
can event ually b e found. Comb ining t h ese feat ures, t h e rout ing prot ocol ach ieves correct operat ion 
w it h out requiring any know ledge of t h e fault s w it h in t h e net w ork. 

We saw t h at w e could minimize t h e performance impact of fault y component s and int erconnect 
on t h e net w ork b y ident ifying t h ern and masking t h ern from t h e net w ork. A know n, masked fault 
is det erminist ically avoided. Th is avoidance allow s t h e random pat h select ion t o converge more 
quickly on a good pat h b y removing all know n b ad pat h s from t h e space of pot ent ial pat h s. We 
also saw t h at ident ifying fault s allow s us t o make assessment s ab out t h e int egrit y of t h e net w ork 
(Sect ion 5.1, Sect ion 5.7). 

We developed minimally int rusive rnech anisms for locat ing fault s. Th e rout ing prot ocol uses 
t h e pipeline delay cycles associat ed w it h reversing t h e direct ionffafvdat across t h e net w ork t o 
transmit routerch ecksums and det ailed connect ion informat ion b acktoth e source. Th is informat ion 
h elps narrow dow n t h e source of any fault s. Port -b y-port deselect ion and part ial-ext ernal scan 
(Ch apt er 5) allow t h e syst ern t o isolat e regions of t h e net w ork and t est for fault s. Since t h e net w ork 
h as redundant pat h s, port ions of t h e net w ork can b e isolat ed and t est ed in t h is manner w it h out 
int erfering significant ly w it h normal operat ion. 

Finally, w e saw t h at the mech anisms used for fault isolat ion and t est ing, coupled w it h the 
mult iple path sw it h in each net w ork, provide facilit ies for in-operat ion repair. Ph ysicallyreplaceab le 
sub unit s can b e isolat ed, repaired, and ret urned t o service w it h out t aking t h e ent ire net w ork out 
of service (Sect ion 5.6, Sect ion 7.5.6). On-line repair allow s us t o minimize or eliminat e syst em 
dow n-t ime and h ence maximize syst em availab ilit y. 

13.3 Integrated Solutions 

We h ave describ ed a set of t ech niques for b uilding rob ust, low -lat ency mult iprocessor net - 
w orks. Th ese solut ions span a range of implement at ion levels from VLSI circuit s, packaging, and 
int erconnect up t h rough arch it ect ures and organizat ions. Each t ech nique present ed is int erest ing in 
it sow nrigh t forth e feat ures anfitbsitnaaffers. How ever, t h e collect ion of t ech niques present ed 
h ere is most int erest ing b ecause t h e t ech niques int egrat e smoot h ly int o a complet e syst em. Wh en 
assernb led, w e do get a syst em w h ich reaps the cumuldltiseEfearad b y all of t h e t ech niques. 

Th e feat ures of many of t h e t ech niques compliment each ot h er such t h at the overall feat ures and 
b cntit s of t h e composit e syst em are great er t h an t h e feat ures of t h e individual pieces. 


186 



A. Performance Simulations 1 


In t h is appendix, w e w ill describ e t h e simulat ions used t o measure net w ork performance. We w ill 
b egin b y describ ing some feat ures of t h e b asic arch it ect ure modeled. We revist some pract ical 
issues of net w ork const ruct ion involving the input s and out put s of t h e net w ork. Finally, w e develop 
met h ods for exercising net w orks b ased on represent at ive net w ork loads t aken from sh ared memory 
applicat ions. 

A.l The Simulated Architecture 

Th enetw orks simulat ed use a circuit-sw itch ed rout ing component b ased upon RN1 (See Ch ap- 
t er 8) and Met ro (See Ch apt er 9). Th e part icular component used t h rough out t h ese experiment s 
can act eit h er as a single 8-input, radix-4, dilat ion-2 rout er, or as t w o independent 4-input, radix-4, 
dilat ion-1 rout ers. 

To aid the rout ing of messages, each component w ill h ave a pin dedicat ed t o calcfflito ing 
cont rol informat ion according t o t h e follow ing b locking crit erion t aken from Leigh t on and Maggs 
in [LM92], A rout er is blocked if it does not h ave at least one unused, operat ional out put port in 
each logical direct ion w h ich leads t o a rout er w h ich is not b locked. To rout e a message, a rout er 
at t ernpt s t o ch oose a single out put pofirh Rooking at unused port s, and second eliminat ing any 
port s w h ich are b locked. If no unique ch oice -arisadi port s unused, b ut all unb locked or all 
b locked— t h en t h e rout er randomly decides b et w een port s. 

Each ch ip also incorporat es a serial t est -access port (TAP) w h ich accesses. Th ese port s, in t urn, 
are connect ed t oget h er in a diagnost ic net w ork w h ich can provide in-operat ion diagnost ics and ch ip 
reconfigurat ion as det ailed in Ch apt er 5. 

A.2 Coping with Network I/O 

Much of t h e fault t olerance and rout ing b eh avior of our net w orks is dominat tkisb antlh e 
last st ages. Alt h ough mult ipat h net w orks provide mult iple pat h s b et w een any two nodes, t h ese 
pat h s can only use a large numb er of ph ysically dist inct rout ers t ow ards t h e middle of t h e net w ork. 
Near t h e nodes, t h ese pat h s must concent rat e t ow ards t ficaMRHjsfflsiand dest inat ions. Th is 
concent rat ion is most severe in t Hiist and last st ages, w h ere each node only h as a small numb er 
of connect ions t o t h e net w ork (See Sect ion 3.5.2). For t h e sake of t h ese simulat ions w e assume 
dilat ion one rout ing component s arc used in t hinal st age of t h e net w ori$. (Figure 3.11). 

1 Th is informat ion is reprint ed w it h sligh icntddn from [Ch o92] 


187 



A.3 Network Loading 

In t h is sect ion, w e derive net w ork loads for use w it h our performance simulat ions. We need t o 
run a large numb er of simulat ions t o ob t ain average performance of each net w ork at various fault 
levels. Consequent ly, w e use simple synt h et ic loadings t o keep simulat ion t ime manageab le. We 
st art b y using uniformly dist rib ut ed random dest inat ions for our messages. lWic rsur model b y 
looking at sh ared-memory applicat ions st udied b yth e MIT Alew ife Project [CFKA90]. 

A.3.1 Modeling Shared-Memory Applications 

Our goal is t o provide a realist ic model of net w ork ut ilizat ion t h at can b e used t o compare 
many different net w orks and paramet ers. To keep simulat ion t ime t ract ab le, w e use a variant of 
uniform t rafic. Our simulat ion sends messages t o random dest inat ions in a uniform dist rib ut ion. 

How ever, message lengt h s are randomly generat ed according t o dist rib ut ions derived froificspeci 
parallel applicat ions. Th ese applicat ion w ere t aken from each ing st udies done b y t h e Alew ife 
Project [CFKA90]. Th ese st udies simulat e a sh ared memory arch it ect ure w it h coh erent each es at 
each processing node. Dat a t aken from t h is st udy corresponds t o t h e follow ing syst em paramet ers: 

• Sh ared memory, coh erent each es 

• Full-map direct ories 

• 16-b yt e each e lines 

• 64 nodes, corresponding t o 3-st age, radix-4 net w orks. 

• Single t h read 

• CISC inst ruct ions 

• 1 memory reference per inst ruct ion 

• Processors st all aft er 1 out st anding memory reference 

• Barrier synch ronizat ion 

A.3.2 Application Descriptions 

[CFKA90] st udied four applicat ions: SIMPLE, SPEECH, FFT, and WEATHER. SIMPLE models 
t h e h ydrodynamic b eh aviofluifls using Unit e difference met h ods t o solve t h e equat ions in 
two dimensionsSPEECH is t h e lexical decoding st age of a ph onet ically-b ased spoken language 
underst anding syst em. It uses a variant of t h e Vit erb i search algoFitlhismradix-2 Fast Fourier 
Transform, weather uses Unit e-difference met h ods t o solve part ial different ial equat ions w h ich 
model t h e at mosph ere around the glob e. 

Th ese applicat ions at t empt t o represent t h ree major classes of prob lems: graph prob lems, 
cont inuurn prob lems, and part icle prob lems. Graph prob lems involve search ing and irregular 
communication. Cont inuurn prob lems generally h ave localized communicat ion in regular pat t erns. 

Part icle prob lems oft en involve communicat ion over long dist ances t o simulat e int eract ions such 
as t h ose due t o gravit at ional forces. 


188 



A.3.3 Application Data 

Cach e-coh erent sh ared memory syst ems ut ilize a small set of message t ypes, eaclfixexf a 
lengt h . Th e messages sent t h rough t h e net w ork read cach e lines, w lit e cach e lines, and rnaint ain 
cach e coh erency. A cach e-line read, for example, requires an 8-b yt e read request follow ed b y a 
reply cont aining t h e dat a. Th e reply consist s of an 8-b yt e h eader follow ed b y 16 b yt es represent ing 
t h e desired cach e line. 

Tab le A.l list s t h e frequency of all t ransact ions for our four applicat ions. Th e messages sent 
for each t ransact ion are also list ed. In cont rast t o t h e Alew ife st udy, it is import ant for us t o 
dist inguish w h ich t ransact ions are split ph ase. Our net w orks are circuit -sw it ch ed and can save 
rout ing t ime if a reply t o a request is immediat ely availab le. How ever, if t h e t ransact ion must b e 
split int o t w o ph ases, two messages w ill h ave t o b e rout ed separat ely. Tab le A. 1 dist inguish es t h ose 
messages w h ich are single ph ase, alw ays split ph ase, and somet imes split ph ase. Tab le A.2 gives 
t h e percent age split ph ase for t h ose w h ich are somet imes split ph ase. Assuming a t w o rout er cycles 
per processor cycle, our dat a gives us a 3, 6, 7, and 9 percent approximat e message generat ion rat e 
per rout er cycle forWEATHER, SIMPLE, SPEECH, and FFT, respect ively. 

Tab le A.2 also gives the approximat e grain sizes of each applicat ion. Theses w ill b e 
used t o det ermine frequency of b airier synch ronizat ion, t o b e discussed in Sect ion A.3.4. Finally, 

Tab le A.3 gives t h e relat ive frequency of each lengt h of message for each applicat ion. Th is is 
summarized b y t h e average lengt h of messages given for each applicat ion. 

A.3.4 Synchronization 

Our performance simulat ion includes b arrier synch ronizat ion t o eliminiteer. Our simulat ion 
models applicat ions w h ich assume some degree of synch ronizat ion b et w een t h e mult iple processor 
nodes of t h e syst ern. Wh en a simulat ion violat es t h is assumpt ion, result s are skew ed. We prevent 
t h is skew b y performing periodic b arrier synch ronizat ion according t o grain sizes est imat ed for 
each applicat ion. 

It is import ant t o eliminat e simulat ion skew b ecause it can mask t h e effect s of localized 
net w ork degradat ion. Analyt ic models suggest t h at fault s and congest ion may severely affect t h e 
performance ob served b y spefei dest inat ion nodes w h ile leaving ot h ers largely unaffect ed [KR89]. 

Wit h out synch ronizat ion, such localized degradat ion w ould b e lo stvdntlgh KOb andw idt h 
ut ilizat ion. Modeling synch ronizat ion, h ow ever, forces all processors tow ait for t h ose falling 
b eh ind, result ing in a more realist ic decrease in 1/ O b andw idt h ut ilizat ion. 

A.3.5 The flat24 Load 

We also simulat e a uniform message dist rib ut ioHLAT24. FLAT24 uses 24-b yt e messages 
and ot h er simulat ion paramet ers w h ich are similar t o t h e messages and paranStttet^odhd 
SPEECH. FLAT24 serves as a b asis for net w ork comparison. 

For our lat er st udies, w e sh all also HLsAl 24. b ut w e sh all use paramet ers w h ich differ sligh t ly 
from t h ose in t h e Alew ife st udy and correspond more closely t o our t arget arch it ect ures. 

To derive t h e frequency of message loading, scvcrdlrcasonab Id paramet er values w ere ch osen. 

Not e t h at our result s are not overly sensit ive t o loading, so rfigghes are adequat e. Th e program 
code for each processor is assumed t o b e resident in local memory. Consequent ly, only dat a 


189 



Message Type 

WEATHER 

SIMPLE 

SPEECH 

EFT 

read miss, w lit e mode 
[h dr,h dr+dat a] (h dr,h dr+c 


0.4500 

1.8700 

0.9600 

read miss, not w rit e mode 
(h dr,h dr+dat a) 

0.8400 

4.2400 

1.3500 

1.7500 

read miss, not in any each e 
(h dr,h dr+dat a) 

0.6700 

0.7700 

0.0800 

0.1300 

w rit e miss, w rit e mode 
[h dr,h dr+dat a] (h dr,h dr+c 


0.6300 

0.0100 

2.5100 

w rit e miss, not w rit e mod 
?h dr,h dr+dat a? 


0.1900 

0.0000 

0.1000 

w rit e h it, not w rit e mod 
?h dr,h dr? 

0.5500 

0.4500 

1.8500 

0.9900 

w rit e miss, not in any each 
(h dr,h dr+dat a) 

■a 

0.3000 

0.1500 

0.0000 

inst ruct ion miss 
(h dr,h dr+dat a) 

0.0700 

0.1900 

0.0000 

0.2700 

privat e misses 
(h dr,h dr+dat a) 

0.1159 

0.1038 

0.0013 

0.1821 

invalidat ions 
(h dr,h dr) 

0.4318 

1.2562 

2.7455 

2.8004 

evict ions 
(h dr,h dr) 

0.0000 

0.0000 

0.0000 

0.0000 

replacement s of dirt y dat a 
(h dr+dat a) 

0.0407 

0.4512 

0.0090 

0.0196 

synch ronizat ions not each e 
(h dr,h dr+dat a) 

0.0000 

0.0000 

0.0000 

0.0000 


h dr = packet h eader, dat a = each e line 
[...] = split ph ase message comb inat ion 
(...) = single ph ase 
?...? = somet imes split ph ase 

Relat ive t ransact ion frequencies for each of our four applicat ions, in t ransact ions per pro¬ 
cessor cycle, are given ab ove. Messages sent for each kind of t ransact ion are also given. 
(Dat a Court esy of David Ch aiken) 


Tab leA.l: Relat ive Transact ion Frequencies for Sh ared-Memory Applicat ions 


190 

















































WEATHER 

SIMPLE 

SPEECH 

FFT 

percent split ph as 

: 85.5560 

91.7210 

100.0000 

88.8396 

grain size 

59769(4) 

187714(|) 

7271 

10001 o 

10000 

28289 


For each applicat ion, the percent split ph ase is given for t h ose list ed as somet imes split 
ph ase (?.. ?) in Tab le A.l. Approximat e grain sizes are also given for each applicat ion. 

Th ere are two grain sizes feEATHER, one for each of t w o ph ases in t h e applicat ion. 


Tab le A.2: Split Ph ase Transact ions and Grain Sizes for Sh ared-Memory Applicat ions 



WEATHER 

SIMPLE 

SPEECH 

FFT 

8-b yt e 

2.9211 

2.0798 

5.5800 

5.3179 

16-b yt e 

0.5112 

1.2936 

2.7455 

2.9109 

24-b yt e 

1.1207 

1.7055 

1.8890 

3.5784 

32-b yt e 

2.4359 

5.9295 

3.3813 

5.6832 

Average Lengt h 

21.2178 

24.3463 

17.8074 

20.4033 


Relat ive frequencies of each lengt h of message are given. Th ese are summarized b y t h e 
average lengt h of messages for each applicat ion. 


Tab leA.3: Message Lengt h s for Sh ared-Memory Applicat ions 


references w ill result in non-local memory references. We assumed a dat a each e miss rat e of 15 
percent. For each dat a read or w rit e, w e get 0.15 misses per processor cycle. We assumed t h at 50 
percent ofth e references are t o local memory, w h ich gives us 0.075 references t o non-local memory 
per processor cycle. Wit h a 50 MFIz processor and t h e RN1 part running at b et t er t h an 100 MFIz, 
w e h ave two rout er cycles per processor cycle. Th is gives us 0.0375 non-local memory references 
per rout er cycle. Adding an addit ional 10 percent t o account for each e coh erency messages, w e 
end up w it h approximat ely 0.04 messages per rout er cycle. 

We also examine t ime b et w een synch ronizat ions, or, equivalent ly, applicat ion grain size. A 
grain size represent at ive of applicat ions st udied is 10,000 cycles, or J 000 x 0.04 = 400 messages. 

Alt oget h er, our performance met ric is t h e t ime t o rout e t h e follow ing t ask: all processors in t h e 
syst ern must each send 400 24-b yt e messages at a rat e of 0.08 messages per act ive processor cycle. 

We assume t h at each processor can h ave up t o 41 h reads, or t asks, each w it h an out st anding message, 
b efore st ailing. 

Not e t h at our t ask explicit ly models b arrier synch ronizat ion. Ot h er st yles of synch ronizat ion 
may involve smaller processor groups, b ut may also t end t o synch ronize t h ese groups more oft en. 

In any case, it is import ant t o model t h e synch ronizat ion requirement s of an applicat ion. For our 
purposes, b arrier synch ronizat ion incorporat es an appropriat e component of t h ese requirement s 
int o our performance met ric. We see t h at synch ronizat ion plays a major role in performance 


191 





degradat ion. Th is degradat ion occurs w h en net w ork failure result s in a small numb er of nodes w it h 
part icularly poor communicat ion b andw idt h . 

Let us t ake one more look at our numb ers. Since messages are 24 b yt es long, w e are b asically 
running our net w ork at 1 b yt e per rout er cycle, or 100 percent. If t h e message rat e w ere any 
h igh er, t h e processors w ould just b e st ailed more oft en, and t h e net w ork loading w ould not really 
ch ange. Not e t h at our analysis assumes low -lat ency message h andling, a concept demonst rat ed in 
t h e J-Mach ine fl92]. If, as w it h many commercial and research mach ines, t h ere exist s a h igh 
lat ency for message h andling, t h e lat ency induces a feedb ack effect w h ich prevent s full ut ilizat ion 
of t h e net w ork [Joh 92], Alt h ough each e miss and message localit y numb ers are open t o deb at e, w e 
ob serve t h at t h e t ech nological t rends of mult iple-issue processors and w ider each e lines w ill only 
increase demands on t h e net w ork. How ever, performance result s present ed in t h is paper w ere also 
verified t o b e qualit at ively unch anged under net w ork loading h alf of t h at used h ere. 

A.4 Performance Results for Applications 

In t h is sect ion w e summarize our performance result s for each net w ork in t h e presence of 
fault s. Performance w as measured for complet e net w orks only. For each fault level sh ow n, 
mult iple t rials w ere run in w h ich fault s w ere randomly ch osen. Aft er fault insert ion, applicat ions 
w ere simulat ed on t h ose net w orks w h ich remained complet e. Dat a sh ow n represent t h e average 1/ O 
b andw idt h ut ilizat ion and lat encies over t h ose t rials involving complet e net w orks. Th ese result s do 
not represent t h e act ual performance of t h ese applicat ions on h ardw are. Rat h er, our dat a provides 
a b asis for net w ork comparison b y illust rat ing performance t rends in t h e presence of fault s. 

To isolat e t h e effect s of int erw iring, t h e det erminist ic and random net w orks present ed use 
dilat ion-2 component s in t h e last st age. We st udied 3-st age non-int erw ired, randomly-int erw ired, 
and det erminist ically-int erw ired net w orks. 1/ O b andw idt h ut ilizat ion varied b y less t h an 2 percent, 
a vai iat ion not sigificant for our simulat ion accuracy. 

How ever, t o avoid single-point disconnect ions in t h e last st age, t h e int erw ired net w orks w e 
sh all analyze for fault performance are const ruct ed from dilat ion-1 component s in t h e last st age. 

Alt h ough dilat ed component s provide b et t er performance, t h e use of t h ese dilat ion-1 component s 
sub st ant ially increases fault t olerance (See Sect ion 3.5.2). For applicat ions st udied on 3-st age 
net w orks, the decrease in 1/ O b andw idt h ut ilizat ion w as less t h an 6 percent. 

Figures A. 1 and A.2 det ails t h e fault performance of our applicat ions on 3-st age, radix-4, 
net w orksw h ich canw it h st and fault s. Th eHpjffRfc^afonow n as a solid line, is represent at ive 
of graph t rends and w ill b e used in net w ork comparisons. 

Figures A.3 and A.4 compares t h e performance of t h ose net w orks w h ich can t olerat e fault s. 

Th e performance of t h e random net w ork is sligh t ly b et t er t h an t h at of t h e det erminist ic net w ork. 
How ever, recall t h at figures are for complet e net w orks only. For each fault level sh ow n, t h e 
random net w orks h ave a low er prob ab ilit y of remaining complet e t h an t h e det erminist ic net w orks. 


192 




0.0 4.2 8.3 12.5 16.6 


Percent Failure 


Random Net w ork 1/ O 



Percent Failure 


Random Net w ork Lat encies 

I/Ob andw idt h ut ilizat ion and lat encies for applicat ions on 3-st age random net w orks. Th e 
applicat ionFLAT24, sh ow n as a solid line, is represent at ive of graph t rends and w ill b e used 
for net w ork comparison. 

Figure A. 1: Applicat ions on 3-st age Random Net w orks 


193 



Percent Failure 


Det erminist ic Net w ork 1/ O 



Percent Failure 


Det erminist ic Net w ork Lat encies 

1/ O b andw idt h ut ilizat ion and lat encies for applicat ions on 3-st age det erminist ic net w orks. 
Th e applicat ioHLAT24, sh ow n as a solid line, is represent at ive of graph t rends and w ill b e 
used for net w ork comparison. 

Figure A.2: Applicat ions on t h e 3-st age Det erminist ic Net w ork 


194 




Percent Failure 


3-St age Net w ork 1/ O 



Percent Failure 


3-St age Net w ork Lat encies 

Comparat ive 1/ O b andw idt h ut ilizat ion and lat encies for 3-st age det erminist ic and random 
net w orks CBLAT24. Recall from Tab le 3.4 t h at expect ed percent ages of failure t olerat ed 
b y random and det erminist ic net w orks are, respect ively: 10% and 16%. Not e t h at the 
performance degradat ion appears t o level off b ecause only complet e net w orks are measured. 

Alt h ough the surviving net w orks suffer less degradat ion as percent age of failure increases, 
the numb er of surviving net w orks is b ecoming sub st ant ially smaller. 

Figure A.3: Comparat ive Performance of 3-St age Net w orks 


195 




Percent Failure 


4-St age Net w ork 1/ O 



Percent Failure 


4-St age Net w ork Lat encies 

Comparat ive 1/ O b andw idt h ut ilizat ion and lat encies for 4-st age random and det erminist ic 
net w orks CBLAT24. Recall from Tab le 3.4 t h at expect ed percent ages of failure t olerat ed 
b y random and det erminist ic net w orks are, respect ively: 4.6%, and 8.8%. Not e t h at the 
performance degradat ion appears t o level off b ecause only complet e net w orks are measured. 

Alt h ough the surviving net w orks suffer less degradat ion as percent age of failure increases, 
the numb er of surviving net w orks is b ecoming sub st ant ially smaller. 

Figure A.4: Comparat ive Performance of 4-St age Net w orks 


196 



Bibliography 


[ACD+91] 

[ACM 8 8] 

[AI87] 

[ALKK90] 

[And85] 

[Baz85] 

[BJN+86] 

[Bra90] 

[B iic 89] 
[CCS+88] 


Anant Agarw al, David Ch aiken, GodfreySduza, Kirk Joh nson, David Kranz, Joh n 
Kub iat ow icz, Kiyosh i Kurih ara, Beng-Hong Lim, Gino Maa, Dan Nussb aum, Mike 
Parkin, and Donald Yeung. Th e MIT Alew ife Mach ine: A Large-Scale Dist rib ut ed- 
Memory Mult iprocessor. MIT/LCS/TM 454, MIT, 545 Tech nology Sq., Camb ridge, 
MA 02139, Novemb er 1991. 

Arvind, D. E. Culler, and G. K Mass. Assessing t h e Beiftt s of Fine-Grain Parallelism 
in Dat itlow Programs. The International Journal of Soupercomputer Applications, 

2(3), Novemb er 1988. 

Arvind and R. A. Ianucci. Tw o Fundament al Issues in Mult i p roccss i n g. Proceedings 
ofDFVLR Conference on Parallel Processing in Science and Engineering, pages 61— 

88, West Germany, June 1987. 

Anant Agaiw al, Beng-Hong Fim, David Kranz, and Joh n Kub iat ow icz. APRIL: A 
Processor Arch it ect ure for Mult iprocessing. Proceedings of the 17th International 
Symposium on Computer Architecture, pages 104-114. IEEE, May 1990. 

T. Anderson, edit or. Resilient Computing Systems, volume I, ch apt er 10, pages 178 
196. Wiley-Int erscience, 1985. Ch apt er b y: C. I. Dimmer. 

M. Bazes. A Novel Percision MOS Synch ronous Delay Line. IEEE Journal of 
Solid-State Circuits, 20(6): 1265—1271, Decernb er 1985. 

L. A. Bergman, A. R. Joh nst on, R. Nixon, S. C. Esener, C. C. Guest, P. K. Yu, T. J. 
Drab ik, and M. R. Feldman S. H. Lee. Holograph ic Opt ical Int erconnect s for VLSI. 
Optical Engineering, 25(10): 1009-1118, Oct ob er 1986. 

Ch rist oph er W. Branson. Int egrat ing Test er Pin Elect rot iEEE Design and Test of 
Computers, pages 4-14, April 1990. 

Leonard Buch off. Elast omeric Connect ors for Land Grid Array Packag ^Connection 
Technology, April 1989. 

Barb ara A. Ch appell, Terry I. Ch appell, St anley E. Sh ust er, Hermann M. Segmuller, 
James W. Allan, Rob ert L. Franch , and Ph illip J. Rest le. Fast CMOS ECL Receivers 
Wit h 100-mV Worst -Case Sensit \v\iREE Journal of Solid-State Circuits, 23(1):59- 
67, Feb ruary 1988. 


197 



[CED92] 

[CFKA90] 

[Ch o92] 

[CK92] 

[Com90] 

[Cor89] 

[Cor90] 

[Cor91] 

[CSS+91] 

[CYH84] 

[D+92] 

[Dal87] 

[Dal91] 

[DeH90] 


Frederic Ch ong, Eran Egozy, and A nth' DeHon. Fault Tolerance and Performance 
of Mult ipat h Mult ist age Int erconnect ion Net w orks. In Th omas F. Knigh t Jr. and 
Joh n Savage, edit ors Advanced Research in VLSI and Parallel Systems 1992, pages 
227-242. MIT Press, March 1992. 

David Ch aiken, Craig Fields, Kiyosh iKurih ara, andAnant Agarw al. Directory-Based 
Cach e-Coh erence inFarge-Scale Mult iproccsso/£EE Computer, 23(6):41-58, June 

1990. 

Frederic T. Ch ong. Performance Issues of Fault Tolerant Mult ist age Int erconnect ion 
Net w orks. Mast’sit h esis, MIT, 545 Tech nology Sq., Camb ridge, MA 02139, May 
1992. 

Frederic T. Ch ong and Th omas F. Knigh t, Jr. Design and Performance of Mult ipat h 
MIN Arch it ect ures. I Symposium on Parallel Architectures and Algorithms, pages 
286-295, San Diego, California, June 1992. ACM. 

IEEE St andards Commit t cc.lEEE Standard Test Access Port and Boundary-Scan 
Architecture. IEEE, 345 East 47th Street, New York, NY 10017-2394, July 1990. 

IEEE St d 1149.1-1990. 

Engineering Plast ics Division Hoech st Celanese Corporat ion. Vect ra Liquid Cryst al 
Polymer Molding Guidelines and Specificat ions (VC-6), 1989. 

Rogers Corporat ion. Isocon Int erconnect ion Feat ures and Applicat ions, 1990. 

Int el Coiporat ion. Paragon XP/ S. Product Overview , 1991. 

David E. Culler, Anurag Sah , Klaus Erik Sch auser, Th orst en von Eicken, and Joh n 
Waw rzynek. Fine-grain Parallelism w it h Minimal Hardw are Support : A Compiler- 
Cont rolled Th readed Ab st ract Mach inePrisreedings of the Fourth International 
Conference on the Architectural Support for Programming Languages and Operating 
Systems, April 1991. 

Ch inCh i-Yuan and Kai Hw ang. Connect ion Principles for Mult ipat h Packet Sw itch ing 
Net w orks. 1 kOth Annual Symposium on Computer Architecture, pages 99-108,1984. 

William J. Dally et al. Th e Message-Driven Processor: A Mult icomput er Processing 
Node w it h fiEfent Mech anisms/EEE Micro, pages 23-39, April 1992. 

William J. Dally. A VLSI Architecture for Concurrent Data Structures. Kluw er 
Academic Pub lish ers, 1987. 

William J. Dally. Express Cub es: Improving t h e performance Mary /i-cub e int er¬ 
connect ion net w or ME RE Transactions on Computers, 40(9): 1016-1023, Sept emb er 

1991. 

Andre DeHon. Fat -Tree Rout ing For Transit. Al Tech nical Report 1224, MIT A dial 
Int elligence Lab orat ory, April 1990. 


198 



[DeH91] 

[DeH92] 

[Div89] 

[DKS93] 

[DS86] 

[DZ83] 

[E+92] 

[Fuj92] 

[GC82] 

[GK92] 

[GL85] 

[GM82] 

[GPM92] 

[HLw n] 

[Hui90] 


Andre DeHon. Pract ical Sch ernes for Fat -Tree Net w ork Const ruct ion. In Carlo H. 
Sequin, edit or. Advanced Research in VLSI: International Conference 1991. MIT 
Press, March 1991. 

Andre DeHon. Scan-Based Test ab ilit y for Fault -t olerant Arch it ect ures. In Duncan M. 
Walker and Fab rizio Fornb ardi, edit 01 Proceedings of the IEEE In ternational Work¬ 
shop on Defect and Fault Tolerance in VLSI Systems, pages 90-99. IEEE, IEEE 
Comput er Societ y Press, 1992. 

Cinch Connect or Division. CIN::APSE Product Broch ure, 1989. 

Andre DeHon, Th omas F. Knigh t Jr., and Th omas Simon. Automatic Impedance 
Control. InISSCC Digest of Technical Papers, pages 164-165. IEEE, Feb ruary 1993. 

William J. Dally and Ch arles L. Siet z The Torus Routing Chip, volume 1 of Distributed 
Computing, pages 187-196. Springer-Verlag, 1986. 

J. D. Day and J. Zimmermann. Th e OSI Reference Model. IEEE Transactions on 
Communications, 71:1334-1340, Decernb er 1983. 

Th orst en von Eicken et al. Act ive Messages: a Mech anism for Int egrat ed Communi- 
cat ion and Comput at ion. Iff roceedings of the 19th Annual Symposium on Computer 
Architecture , Queensland, Aust ralia. May 1992. 

Fujipoly. Connect or W Series, 1992. 

C. Clark Jr. George and J. B. Cain. Error-Correction Coding for Digital Communica¬ 
tions. Plenum Press, New York, 1982. 

Th addeus J. Gab ara and Scot t C. Knauer. Digit ally Adjust ab le Resist ors in CMOS for 
High -Performance Applicat ions IEEE Journal of Solid-State Circuits, 27(8): 1176— 

1185, August 1992. 

Ronald I. Greenb erg and Ch arles E. Leiserson. Randomized Rout ing on Fat -Trees. 

In IEEE 26th Annual Symposium on the Foundations of Computer Science. IEEE, 
Novemb er 1985. 

P. Goel and M. T. McMah on. Electronic Ch ip-In-Place Test . Rmceedings 1982 
International Test Conference, pages 83-90. IEEE, 1982. 

Dirnit ry Grab b e, Ryszard Prypot niew icz, and Henri Merkelo. AMPSTAR Connect or 
Family, 1992. 

Joh an Hast ad and Tom Leigh t on. Analysis of Backoff Prot ocols for Mult iple Access 
Ch annels. In Unknown. Unknow n, Unknow n. 

Joseph Y. Hui. Switching and Traffic Theory for Integrated Broadband Networks. 

Kluw er Academic Pub lish ers, Bost on, 1990. 


199 



[Inc90] AMP Incorporat ed. Conduct ive Carb on Fib er Connect ors Product Announcement, 
July 1990. Cat alog 90-749. 

[Int 89] Intel Corporation, Literature Sales, P.O. Box 58130, Santa Clara, CA 95052-8130. 
80960CA User’s Manual, 1989. 

[Joh 88] MarkG. Joh nson. AVariab le Delay Line PLL for CPU-Coprocessor Synch ionization. 
IEEE Journal of Solid-State Circuits, 23(5): 1218—1223, Oct ob er 1988. 

[Joh 92] Kirk L. Joh nson. Th e Impact of Communicat ion Localit y on Large-Scale Mult pro¬ 
cessor Performance. In Proceedings of the 19th Annual International Symposium on 
Computer Architecture, Queensland, Aust ralia, May 1992. 

[Jor83] J. F. Jordan. Performance Measurement on HEP- A pipelined MIMD Comput er. In 
Proceedings of the 19th Annual International Symposium on Computer Architecture. 

IEEE, June 1983. 

[Kah 91] Nab il Kah ale. Bet t er Expansion for Ramanujan Graph S,2Ad Annual Symposium 
on Foundations of Computer Science, pages 398-404. IEEE, Oct ob er 1991. 

[KK88] Th omas F. Knigh t Jr. and Alexander Krymm. A Self-Terminat ing Low -Volt age Sw ing 
CMOS Out put Driver IEEE Journal of Solid-State Circuits, 23(2), April 1988. 

[KLS90] Kat h leen Knob e, Joan D. Lukas, and Guy L. St eele, Jr. Dat a Opt imizat ion: Allocat ion 
of Arrays t o Reduce Communicat ion on SIMD Mach ines Ioumal of Parallel and 
Distributed Computing, 8:102-118, 1990. 

[KMZ79] Bernd Konemann, Joach im Much a, and (lilt h er Zw ieh off. Built-In Logic Block 
Ob servat ion Tech niques. Proceedings 1979 International Test Conference, pages 
37-41. IEEE, 1979. 

[KR89] Vij ay P. Kumar and Andrew L. Reib man. Failure Dependent Performance Analysis of a 
Fault -Tolerant Mult ist age Int erconnect ion Net wIEEE Transactions on Computers, 

38(12): 1703-1713, Decernb er 1989. 

[KS86] Clyde P. Kruskal and Marc Snir. A Unified Th eory of Int erconnect ion Net w ork 
Structure. In Theoretical Computer Science, pages 75-94, 1986. 

[Lak86] Ron Lake. A Fast 20K Gat e Array w it h On-Ch ip Test S yWdShSystems Design, 
7(6):46-55, June 1986. 

[LeB84] Joh nny J. LeBlanc. LOCST: A Built -In Self-Test Tech niqi IEEE Design and Test, 
pages 45-52, Novernb er 1984. 

[Lei85] Ch arles E. Leiserson. Fat -Trees: Universal Net w orks for Hardw dieifift Super- 
comput ing .IEEE Transactions on Computers, C-34(10):892-901, Oct ob er 1985. 

[Lei89] Ch arles E. Leiserson. VLSI Th eory and Parallel Supercomput ing. MIT/ LCS/ TM 402, 
MIT, 545 Tech nology Sq., Camb ridge, MA 02139, May 1989. Also appears as an 
invit ed present at ion at t h e 1989 Calt ech Decennial VLSI Conference. 


200 



[LHM+89] 

[LLG+91] 

[LM89] 

[LM92] 

[LP83] 

[LR86] 

[Mal91] 

[MB88] 

[MDK91] 

[Mil91] 

[Min91] 

[MR91] 

[MT90] 


Ant h ony L. Lent ine, H. Scot t Hint on, David A. B. Miller, Jill E. Henry, J. E. Cunning- 
h am, and Leo M. F. Ch irovsky. Symmert lie Self-Elect roopt ic Effect Device: Opt ical 
Set -Reset Lat ch , Different ial Logic Gat e, and Different ial Modulat or/ Dctf/cLYir. 
Journal of Quantum Electronics, 25(8): 1928-1936, August 1989. 

Daniel Lenoski, James Laudon, Kourosh Gh arach orloo, Wolf-Diet rich Web er, Anoop 
Gupt a, and Joh n Hennessy. Overview and St at us of t h e St anford DASH Mult iproces- 
sor. In Norih isa Suzuki, edit oproceedings of the International Symposium on Shared 
Memory Multiprocessing, pages 102-108. Informat ion Processing Societ y of Japan, 

April 1991. 

Tom Leigh t on and Bruce Maggs. Expanders Migh t Be Pract ical: Fast Algorit h ms 
for Rout ing Around Fault s on Mult ib utiltier. In IEEE 30th Annual Symposium on 
Foundations of Computer Science, 1989. 

Tom Leigh t on and Bruce Maggs. Fast Algorit h ms for Rout ing Around Fault sin Mult i- 
b ut t&ies and Randomly-Wired Split t er Net w or IEEE Transactions on Computers, 

41(5): 1-10, May 1992. 

Duncan Law rie and Krish nan Padmanab h an. A Class of Redundant Pat h Mult ist age 
Int erconnect ion Net w oil IEEE Tr. on Computers, C-32(12): 1099-1108, Decemb er 
1983. 

Frank T. Leigh t on and Arnold L. Rosenb erg. Th ree-Dimensional Circuit Layout s. 
SIAM Journal on Computing, 15(3):793-813, August 1986. 

David Maliniak. Pinless Land-Grid Array Socket Houses Int cT 80386SL Micropro¬ 
cessor. Electronic Design, March 28 1991. 

Glenford J. Myers and David L. Budde. The 80960 Microprocessor Architecture. 
Wiley-Int erscience, 1988. 

Henry Minsky, Andre DeHon, and Th omas F. Knigh t Jr. RN1: Low -Lat ency, Dilat ed, 
Crossb ar Router. IMot Chips Symposium III, 1991. 

David Miller. International Trends in Optics, ch apt er 2, pages 1323. Academic Press, 

Inc., 1991. 

Henry Q. Minsky. A Parallel Crossb ar Rout ing Ch ip for a Sh ared Memory Mult ipro- 
cessor. AI memo 1284, MIT Art ficial Int elligence Lab orat ory, 545 Tech nology Sq., 
Camb ridge, MA 02139, 1991. 

Silvio Micali and Ph illip Rogaw ay. Secure Comput at ion. MIT/ LCS/ TR 513, MIT, 

545 Tech nology Sq., Camb ridge, MA 02139, August 1991. 

Colin M. Maunder and Rodh am E. Tulloss, edit ors. The Test Access Port and 
Boundary-Scan Architecture, ch apt er 20, pages 205/13. IEEE, 1990. Ch apt er b y: 

Pat rick F. McHugh and Lee Wh et sel. 


201 



[ND90] 

[NPA91] 

[NPA92] 

[oD86] 

[PC90] 

[Pol90] 

[Pos81] 

[PW72] 

[RWD84] 

[SBCvE90] 

[Sim92] 

[Smi78] 

[Smo85] 

[SS92] 


Mich ael Noakes and William J. Dally. Syst em Design of t h e J-Mach ine. In William J. 
Dally, edit or Advanced Research in VLSI: Proceedings of the Sixth MIT Conference, 
pages 179-194, Camb ridge, Massach uset t s, 1990. MIT Press. 

R. S. Nikh il, G. M. Papadopoulos, and Arvind. *T: A Killer Micro for A Brave New 
World. CSG Memo 325, MIT, 545 Tech nology Sq., Camb ridge, MA 02139, 1991. 

Rish iyur S. Nikh il, Gregory M. Papadopoulous, and Arvind. *T: A Mult it h readed 
Massively Parallel Arch it ect ure. M Proceedings of the 19th International Symposium 
on Computer ARchitecture. ACM, May 1992. 

U.S. Depart ment of DefenseMilitaty Standardization Handbook: Reliability Predic¬ 
tion of Electronic Equipment, mil-h db k-217e edit ion, 1986. Th ish andb ook is updated 
periodically t o t rack t h e evolut ion of elect ionic equipment t ech nology. MIL-HDBK- 
217 w afirst issued in 1965. 

G. M. Papadopoulos and D. E. Culler. Monsson: an Explicit Token-St ore Arch it ect ure. 

In Proceedings of the 17th Annual International Symposium on Computer Architecture. 

IEEE, 1990. 

Sh in Et su Poly. Sh in-Flex GD Product Summary, 1990. 

(Ed.) Jon Post el. Transmission Cont rol Prot ocol DARPA Int ernet Program Prot ocol 
Specificat ion. RFC 793, USC/ 1ST Informat ion Sciences Inst it ut e, Universit y of Sout h - 
ern California, 4676 Admiralt y Way, Marina del Rey, California, 90291, Sept emb er 
1981. 

W. Wesley Pet erson and E.J. Weldon Jr. Error-Correcting Codes. MIT Press, Cam- 
b ridge, MA, 1972. 

Simon Ramo, Joh n R. Wh innery, and Th eodore Van Duz eF.ields and Waves in 
Communication Electronics. Joh n Wiley and Sons, 2nd edit ion, 1984. 

R. Saavedra-Barrerra, D. E. Culler, and T. von Eicken. Analysis of Mult it h readed 
Arch it ect ures for Parallel Comput ing. Proceedings of the 2nd Annual Symposium 
on Parallel Algorithms and Architectures, July 1990. 

Th omas Simon. Mat ch ed Delay Clock Dist rib ut ion. Personal Communicat ions, 1992. 

B. J. Smit h . A Pipelined, Sh ared-Resource MIMD Comput er. Rmceedings of the 
1978 International Conference on Parallel Processing, pages 6-8, 1978. 

R. Smolley. Button Board, A New Tech nology Interconnect for 2 and 3 Dimen¬ 
sional Packaging. In International Society for Hybrid Microelectronics Conference, 
Novemb er 1985. 

Daniel P. Siew iorek and Rob ert S. Sw Mali able Computer Systems: Design and 
Evaluation. Digit al Press, Burlingt on, MA, 2nd edit ion, 1992. 


202 



[Tec 88 ] 
[Tex91] 
[Th i91] 

[Th 08 O] 

[Upf89] 

[Wag87] 

[Wal92] 

[WBJ+87] 

[Web 90] 


Elast omeric Tech nologies. Elast omeric Connect or Applicat ion Guide, 1988. 

Texas Inst rument s Shrink Small Outline Package Surface Mounting Packages, 1991. 

Th inking Mach ines Corporat ion, Camb ridge, MM5 Technical Summary, Oct ob er 
1991. 

C. D. Th ompson. A Complexit y Th eoryofVLSI. Tech nical Report CMU-CS-80-140, 

Depart ment of Comput er Science, Carnegie-Mellon Universit y, 1980. 

E. Upfal. An 0( log N ) det erminist ic packet rout ing sch eme. 2hst Annual ACM 
Symposium on Theory of Computing, pages 241-250. ACM, May 1989. 

Paul T. Wagner. Int erconnect Test ing w it h Boundary Scan Prhceedings 1987 
International Test Conference, pages 52-57. IEEE, 1987. 

Deb orah Wallach . PHD: A Hierarch ical Cach e Coh erent Protocol. ’Mdstcsis, 

MIT, 545 Tech nology Sq., Camb ridge, MA 02139, Sept emb er 1992. 

W. H. Wu, L. A. Bergman, A. R. Joh nst on, C. C. Guest, S. C. Esener, P. K. Yo, M. R. 

Feldman, and S. H. Lee. Implement at ion of Opt ical Int erconnect ions for VLSEfsf? 

Transactions on Electron Devices, 34(3):706-714, March 1987. 

Steve Web b er. Th e Stratus Arch it ect ure. Stratus Tech nical Report TR-1, Status 
Comput er, Inc., 55 Fairb anks Blvd., Marlb oro, Massach uset t s 01752, 1990. 


203 



